Skip to content

Commit

Permalink
Fixed query parameters not being considered for robots.txt evaluation
Browse files Browse the repository at this point in the history
  • Loading branch information
Toflar committed Dec 15, 2022
1 parent 9e7deb6 commit f73df70
Show file tree
Hide file tree
Showing 7 changed files with 47 additions and 1 deletion.
13 changes: 12 additions & 1 deletion src/Subscriber/RobotsSubscriber.php
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ private function handleDisallowedByRobotsTxtTag(CrawlUri $crawlUri): void
// Check if an URI is allowed by the robots.txt
$inspector = new Inspector($robotsTxt, $this->escargot->getUserAgent());

if (!$inspector->isAllowed($crawlUri->getUri()->getPath())) {
if (!$inspector->isAllowed($this->getPathAndQuery($crawlUri->getUri()))) {
$crawlUri->addTag(self::TAG_DISALLOWED_ROBOTS_TXT);

$this->logWithCrawlUri(
Expand Down Expand Up @@ -273,4 +273,15 @@ private function extractUrisFromSitemap(CrawlUri $sitemapUri, string $content):
$this->escargot->addUriToQueue($uri, $sitemapUri);
}
}

private function getPathAndQuery(UriInterface $uri): string
{
$path = $uri->getPath();

if ($query = $uri->getQuery()) {
$path .= '?'.$query;
}

return $path;
}
}
11 changes: 11 additions & 0 deletions tests/EscargotTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,17 @@ public function shouldRequest(CrawlUri $crawlUri): string
}
}

// Skip the links that are disallowed by robots.txt
if ($crawlUri->hasTag(RobotsSubscriber::TAG_DISALLOWED_ROBOTS_TXT)) {
$this->logWithCrawlUri(
$crawlUri,
LogLevel::DEBUG,
'Do not request because it was disallowed by the robots.txt.'
);

return SubscriberInterface::DECISION_NEGATIVE;
}

// Skip rel="nofollow" links
if ($crawlUri->hasTag(HtmlCrawlerSubscriber::TAG_REL_NOFOLLOW)) {
$this->logWithCrawlUri(
Expand Down
1 change: 1 addition & 0 deletions tests/Fixtures/scenario18/_description.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Test robots.txt with query parameters are considered correctly
3 changes: 3 additions & 0 deletions tests/Fixtures/scenario18/_logs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[Terminal42\Escargot\Subscriber\RobotsSubscriber] [URI: https://www.terminal42.ch/foobar?with-a-query=true (Level: 1, Processed: yes, Found on: https://www.terminal42.ch/, Tags: disallowed-robots-txt)] Added the "disallowed-robots-txt" tag because of the robots.txt content.
[class@anonymous:EscargotTest] [URI: https://www.terminal42.ch/foobar?with-a-query=true (Level: 1, Processed: yes, Found on: https://www.terminal42.ch/, Tags: disallowed-robots-txt)] Do not request because it was disallowed by the robots.txt.
[Terminal42\Escargot\Escargot] Finished crawling! Sent 1 request(s).
1 change: 1 addition & 0 deletions tests/Fixtures/scenario18/_requests.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Successful request! URI: https://www.terminal42.ch/ (Level: 0, Processed: yes, Found on: root, Tags: none).
7 changes: 7 additions & 0 deletions tests/Fixtures/scenario18/robots.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
https://www.terminal42.ch/robots.txt

HTTP/2.0 200 OK
content-type: text/plain; charset=UTF-8

User-agent: *
Disallow: /*?*
12 changes: 12 additions & 0 deletions tests/Fixtures/scenario18/root.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
https://www.terminal42.ch/

HTTP/2.0 200 OK
content-type: text/html; charset=UTF-8

<html>
<head>
</head>
<body>
<a href="https://www.terminal42.ch/foobar?with-a-query=true">Link</a>
</body>
</html>

0 comments on commit f73df70

Please sign in to comment.