Robotparser return wrong result?

Asked
Active3 hr before
Viewed126 times

6 Answers

robotparserresultreturn
90%

This module will take a URL, download and analyze it, and then return the technologies used by the website, Here is an example:,Now, we can run the sitemap crawler to download all countries from the example website:,In this section, we will take advantage of weakness in the website structure to easily access all the content

Example_snippet/controller/utility/_robotparser.js/ # section 1 User-agent: BadCra. . .
# section 1
User - agent: BadCrawler
Disallow: /

# section 2
User - agent: *
   Crawl - delay: 5
Disallow: /trap

# section 3
Sitemap: http: //example.webscraping.com/sitemap.xml
load more v
88%

non-zero exit code is returned from Scrapy commands when error happens on spider initialization (issue 3226),Support exit codes to signal errors in Scrapy commands (#248),The dictionaries in the result list of a media pipeline now include a new key, status, which indicates if the file was downloaded or, if the file was not downloaded, why it was not downloaded; see FilesPipeline,get_media_requests for more information (issue 2893, issue 4486),Stats now include the following entries that indicate the number of successes and failures in storing feeds:

Example_snippet/controller/utility/_robotparser.js/ feedexport/success_count/<stor. . .
feedexport/success_count/<storage type>
   feedexport/failed_count/<storage type>
load more v
72%

Sets the time the robots,txt file was last fetched to the current time

Example_snippet/controller/utility/_result.js/ >>> import urllib.robotparser . . .
>>>
import urllib.robotparser >>>
   rp = urllib.robotparser.RobotFileParser() >>>
   rp.set_url("http://www.musi-cal.com/robots.txt") >>>
   rp.read() >>>
   rrate = rp.request_rate("*") >>>
   rrate.requests
3
   >>>
   rrate.seconds
20
   >>>
   rp.crawl_delay("*")
6
   >>>
   rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
   >>>
   rp.can_fetch("*", "http://www.musi-cal.com/")
True
65%

urllib,robotparser

Example_snippet/controller/utility/_result.js/ # robots.txt for https://www.a. . .
# robots.txt
for https: //www.alza.cz/

   User - Agent: *
   Disallow: /Order1.htm
Disallow: /Order2.htm
Disallow: /Order3.htm
Disallow: /Order4.htm
Disallow: /Order5.htm
Disallow: /download/
Disallow: /muj-ucet/
Disallow: /Secure/
Disallow: /LostPassword.htm
Disallow: /search.htm*

Sitemap: https: //www.alza.cz/_sitemap-categories.xml
   Sitemap: https: //www.alza.cz/_sitemap-categories-producers.xml
   Sitemap: https: //www.alza.cz/_sitemap-live-product.xml
   Sitemap: https: //www.alza.cz/_sitemap-dead-product.xml
   Sitemap: https: //www.alza.cz/_sitemap-before_listing.xml
   Sitemap: https: //www.alza.cz/_sitemap-seo-sorted-categories.xml
   Sitemap: https: //www.alza.cz/_sitemap-bazaar-categories.xml
   Sitemap: https: //www.alza.cz/_sitemap-sale-categories.xml
   Sitemap: https: //www.alza.cz/_sitemap-parametrically-generated-pages.xml
   Sitemap: https: //www.alza.cz/_sitemap-parametrically-generated-pages-producer.xml
   Sitemap: https: //www.alza.cz/_sitemap-articles.xml
   Sitemap: https: //www.alza.cz/_sitemap-producers.xml
   Sitemap: https: //www.alza.cz/_sitemap-econtent.xml
   Sitemap: https: //www.alza.cz/_sitemap-dead-econtent.xml
   Sitemap: https: //www.alza.cz/_sitemap-branch-categories.xml
   Sitemap: https: //www.alza.cz/_sitemap-installments.xml
   Sitemap: https: //www.alza.cz/_sitemap-detail-page-slots-of-accessories.xml
   Sitemap: https: //www.alza.cz/_sitemap-reviews.xml
   Sitemap: https: //www.alza.cz/_sitemap-detail-page-bazaar.xml
   Sitemap: https: //www.alza.cz/_sitemap-productgroups.xml
   Sitemap: https: //www.alza.cz/_sitemap-accessories.xml
load more v
75%

This class provides methods to read, parse and answer questions about the robots,txt file at url

Example_snippet/controller/utility/_result.js/ >>> import urllib.robotparser . . .
>>>
import urllib.robotparser >>>
   rp = urllib.robotparser.RobotFileParser() >>>
   rp.set_url("http://www.musi-cal.com/robots.txt") >>>
   rp.read() >>>
   rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
   >>>
   rp.can_fetch("*", "http://www.musi-cal.com/")
True
40%

Returns the value of the Crawl-delay parameter from robots,txt for the useragent in question

Example_snippet/controller/utility/_result.js/ >>> import urllib.robotparser . . .
>>>
import urllib.robotparser >>>
   rp = urllib.robotparser.RobotFileParser() >>>
   rp.set_url("http://www.musi-cal.com/robots.txt") >>>
   rp.read() >>>
   rrate = rp.request_rate("*") >>>
   rrate.requests
3
   >>>
   rrate.seconds
20
   >>>
   rp.crawl_delay("*")
6
   >>>
   rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
   >>>
   rp.can_fetch("*", "http://www.musi-cal.com/")
True