python 2.7 - How to solve 403 error in scrapy -
i'm new scrapy , made scrapy project scrap data.
i'm trying scrapy data website i'm getting following error logs
2016-08-29 14:07:57 [scrapy] info: enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] info: spider opened 2016-08-29 13:55:03 [scrapy] info: crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] debug: crawled (403) <get http://www.justdial.com/robots.txt> (referer: none) 2016-08-29 13:55:04 [scrapy] debug: crawled (403) <get http://www.justdial.com/mumbai/small-business> (referer: none) 2016-08-29 13:55:04 [scrapy] debug: ignoring response <403 http://www.justdial.com/mumbai/small-business>: http status code not handled or not allowed 2016-08-29 13:55:04 [scrapy] info: closing spider (finished)
i'm trying following command on website console got response when i'm using same path inside python script got error have described above.
commands on web console:
$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingr0"]/h4/span/a/text()') $x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingr0"]/p[@class="contact-info"]/span/a/text()')
please me.
thanks
like avihoo mamka mentioned in comment need provide request headers not rejected website.
in case seems user-agent
header. default scrapy identifies user agent "scrapy/{version}(+http://scrapy.org)"
. websites might reject 1 reason or another.
to avoid set headers
parameter of request
common user agent string:
headers = {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'} yield request(url, headers=headers)
you can find huge list of user-agents here, though should stick popular web-browser ones firefox, chrome etc. best results
you can implement work spiders start_urls
too:
class myspider(scrapy.spider): name = "myspider" start_urls = ( 'http://scrapy.org', ) def start_requests(self): headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'} url in self.start_urls: yield request(url, headers=headers)
Comments
Post a Comment