python 2.7 - How to solve 403 error in scrapy -


i'm new scrapy , made scrapy project scrap data.

i'm trying scrapy data website i'm getting following error logs

2016-08-29 14:07:57 [scrapy] info: enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] info: spider opened 2016-08-29 13:55:03 [scrapy] info: crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] debug: crawled (403) <get http://www.justdial.com/robots.txt> (referer: none) 2016-08-29 13:55:04 [scrapy] debug: crawled (403) <get http://www.justdial.com/mumbai/small-business> (referer: none) 2016-08-29 13:55:04 [scrapy] debug: ignoring response <403 http://www.justdial.com/mumbai/small-business>: http status code not handled or not allowed 2016-08-29 13:55:04 [scrapy] info: closing spider (finished) 

i'm trying following command on website console got response when i'm using same path inside python script got error have described above.

commands on web console:

$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingr0"]/h4/span/a/text()') $x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingr0"]/p[@class="contact-info"]/span/a/text()') 

please me.

thanks

like avihoo mamka mentioned in comment need provide request headers not rejected website.

in case seems user-agent header. default scrapy identifies user agent "scrapy/{version}(+http://scrapy.org)". websites might reject 1 reason or another.

to avoid set headers parameter of request common user agent string:

headers = {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'} yield request(url, headers=headers) 

you can find huge list of user-agents here, though should stick popular web-browser ones firefox, chrome etc. best results

you can implement work spiders start_urls too:

class myspider(scrapy.spider):     name = "myspider"     start_urls = (         'http://scrapy.org',     )      def start_requests(self):         headers= {'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:48.0) gecko/20100101 firefox/48.0'}         url in self.start_urls:             yield request(url, headers=headers) 

Comments

Popular posts from this blog

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -