python - Scrapy 返回 403 错误(禁止)

标签 python scrapy web-crawler http-status-code-403

我对 Scrapy 以及使用 Python 都很陌生。过去,我曾设法获得一个 Scrapy 工作的最小示例,但此后就再也没有使用过它。 与此同时,一个新版本已经发布(我认为我上次使用的版本是 0.24),但无论如何我都无法弄清楚为什么会收到 403 错误我尝试抓取哪个网站。

当然,我还没有深入研究中间件和/或管道,但我希望能够在进一步探索之前先运行一个最小的示例。话虽这么说,这是我当前的代码:

items.py

import scrapy

class StackItem(scrapy.Item):
   title = scrapy.Field()
   url = scrapy.Field()

stack_spider.py

#derived from https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem

class StackSpider(Spider):
    handle_httpstatus_list = [403, 404] #kind of out of desperation. Is it serving any purpose?
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            self.log(question)
            item = StackItem()
            item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
            yield item

输出

(pyplayground) 22:39 ~/stack $ scrapy crawl stack                                                                                                                             
2016-03-07 22:39:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: stack)                                                                                                          
2016-03-07 22:39:38 [scrapy] INFO: Optional features available: ssl, http11                                                                                                   
2016-03-07 22:39:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'stack', 'RET
RY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408], 'DOWNLOAD_DELAY': 3}                                                                                                
2016-03-07 22:39:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState                                                           
2016-03-07 22:39:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddlewa
re, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats                  
2016-03-07 22:39:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware                
2016-03-07 22:39:39 [scrapy] INFO: Enabled item pipelines:                                                                                                                    
2016-03-07 22:39:39 [scrapy] INFO: Spider opened                                                                                                                              
2016-03-07 22:39:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)                                                                         
2016-03-07 22:39:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023                                                                                                
2016-03-07 22:39:39 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 1 times): 403 Forbidden                                 
2016-03-07 22:39:42 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 2 times): 403 Forbidden                                 
2016-03-07 22:39:47 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 3 times): 403 Forbidden                                 
2016-03-07 22:39:51 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 4 times): 403 Forbidden                                 
2016-03-07 22:39:55 [scrapy] DEBUG: Retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 5 times): 403 Forbidden                                 
2016-03-07 22:39:58 [scrapy] DEBUG: Gave up retrying <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (failed 6 times): 403 Forbidden                         
2016-03-07 22:39:58 [scrapy] DEBUG: Crawled (403) <GET http://stackoverflow.com/questions?pagesize=50&sort=newest> (referer: None)                                            
2016-03-07 22:39:58 [scrapy] INFO: Closing spider (finished)                                                                                                                  
2016-03-07 22:39:58 [scrapy] INFO: Dumping Scrapy stats:                                                                                                                      
{'downloader/request_bytes': 1488,                                                                                                                                            
 'downloader/request_count': 6,                                                                                                                                               
 'downloader/request_method_count/GET': 6,                                                                                                                                    
 'downloader/response_bytes': 6624,                                                                                                                                           
 'downloader/response_count': 6,                                                                                                                                              
 'downloader/response_status_count/403': 6,                                                                                                                                   
 'finish_reason': 'finished',                                                                                                                                                 
 'finish_time': datetime.datetime(2016, 3, 7, 22, 39, 58, 458578),                                                                                                            
 'log_count/DEBUG': 8,                                                                                                                                                        
 'log_count/INFO': 7,                                                                                                                                                         
 'response_received_count': 1,                                                                                                                                                
 'scheduler/dequeued': 6,                                                                                                                                                     
 'scheduler/dequeued/memory': 6,                                                                                                                                              
 'scheduler/enqueued': 6,                                                                                                                                                     
 'scheduler/enqueued/memory': 6,                                                                                                                                              
 'start_time': datetime.datetime(2016, 3, 7, 22, 39, 39, 607472)}                                                                                                             
2016-03-07 22:39:58 [scrapy] INFO: Spider closed (finished) 

最佳答案

你肯定是在代理后面。检查并适当设置您的 http_proxyhttps_proxy 环境变量。交叉检查 curl 是否可以从终端获取该 URL。

关于python - Scrapy 返回 403 错误(禁止),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35855920/

相关文章:

mysql - scrapy导出到数据库多个表

search-engine - 网络爬虫使用 BFS 还是 DFS?

python - 在 pyspark 中使用基于 DataFrame 的 API 在 2 个稀疏向量列表之间进行矩阵乘法的最佳方法是什么?

python - Unicode解码错误: cp932 codec can't decode byte 0x81 in position 81

python - 比较数据框两列中的值

python - Scrapy django模型导入错误

python : scrapy using proxy IP

javascript - 对机器人隐藏一段代码

python - 无法运行Scrapy程序

python - Tensorflow U-Net 多类标签