python - 使用scrapy爬取bbs时Twist失败

标签 python web-scraping scrapy twist

我是python scrapy的新手,写了一个简单的脚本来抓取学校论坛上的帖子。但是,当我的蜘蛛运行时,它会收到如下错误消息:

015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A> (failed 2 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: Gave up retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (failed 3 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] ERROR: Error downloading http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A>: [>]

2015-03-28 11:16:56+0800 [nju_spider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 99, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 99, 'downloader/request_bytes': 36236, 'downloader/request_count': 113, 'downloader/request_method_count/GET': 113, 'downloader/response_bytes': 31135, 'downloader/response_count': 14, 'downloader/response_status_count/200': 14, 'dupefilter/filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127, 'log_count/ERROR': 32, 'log_count/INFO': 8, 'request_depth_max': 3, 'response_received_count': 14, 'scheduler/dequeued': 113, 'scheduler/dequeued/memory': 113, 'scheduler/enqueued': 113, 'scheduler/enqueued/memory': 113, 'start_time': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] INFO: Spider closed (finished)

看起来蜘蛛尝试了该 url 但失败了,但该 url 确实存在。 bbs 中大约有数千个帖子,但每次我运行我的蜘蛛时,它只能随机获取其中的几个。我的代码如下,非常感谢您的帮助

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

from ScrapyTest.items import NjuPostItem


class NjuSpider(CrawlSpider):
    name = 'nju_spider'
    allowed_domains = ['bbs.nju.edu.cn']
    start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
    rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
              callback='parse_post'),
             Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
              follow=True)]

    def parse_post(self, response):
        # self.log('A response from %s just arrived!' % response.url)
        post = NjuPostItem()
        post['url'] = response.url
        post['title'] = 'to_do'
        post['content'] = 'to_do'
        return post

最佳答案

首先,请确保您采用网络抓取方法没有违反网站的使用条款。 Be a good web-scraping citizen .

接下来,您可以设置User-Agent header 来假装是浏览器。在 DEFAULT_REQUEST_HEADERS 设置中提供 User-Agent:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
}

或者,您可以使用中间件轮换用户代理。这是我基于 fake-useragent 包实现的:

<小时/>

另一个可能的问题可能是您访问网站的频率太高,请考虑调整 DOWNLOAD_DELAY setting :

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

还有另一个相关设置可以产生积极影响:CONCURRENT_REQUESTS :

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

关于python - 使用scrapy爬取bbs时Twist失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29313350/

相关文章:

python - 如何利用 Scrapy 中的项目管道,按照每个 Spider 类定义的特定顺序保留/导出字段项目

python - Beautifulsoup 解析错误

python - 使用python 3从网页抓取数据,需要先登录

python - 如何使用 scrapy.Request 将另一个页面的元素加载到项目中

python - 从跨度类 XPath 检索值

Python scrapy提取特定Xpath字段

python - Groupby 在 Python 中的多个条件下对多列进行求和和计数

python - 如何自动下载卫星图像?

python - 使用 Flask 在 Python 中修剪() 函数

python - python 中的 Mysql 命令产生 : SyntaxError: EOL while scanning string literal