我是python scrapy的新手,写了一个简单的脚本来抓取学校论坛上的帖子。但是,当我的蜘蛛运行时,它会收到如下错误消息:
015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A> (failed 2 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: Gave up retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (failed 3 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] ERROR: Error downloading http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A>: [>]
2015-03-28 11:16:56+0800 [nju_spider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 99, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 99, 'downloader/request_bytes': 36236, 'downloader/request_count': 113, 'downloader/request_method_count/GET': 113, 'downloader/response_bytes': 31135, 'downloader/response_count': 14, 'downloader/response_status_count/200': 14, 'dupefilter/filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127, 'log_count/ERROR': 32, 'log_count/INFO': 8, 'request_depth_max': 3, 'response_received_count': 14, 'scheduler/dequeued': 113, 'scheduler/dequeued/memory': 113, 'scheduler/enqueued': 113, 'scheduler/enqueued/memory': 113, 'start_time': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] INFO: Spider closed (finished)
看起来蜘蛛尝试了该 url 但失败了,但该 url 确实存在。 bbs 中大约有数千个帖子,但每次我运行我的蜘蛛时,它只能随机获取其中的几个。我的代码如下,非常感谢您的帮助
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ScrapyTest.items import NjuPostItem
class NjuSpider(CrawlSpider):
name = 'nju_spider'
allowed_domains = ['bbs.nju.edu.cn']
start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
callback='parse_post'),
Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
follow=True)]
def parse_post(self, response):
# self.log('A response from %s just arrived!' % response.url)
post = NjuPostItem()
post['url'] = response.url
post['title'] = 'to_do'
post['content'] = 'to_do'
return post
最佳答案
首先,请确保您采用网络抓取方法没有违反网站的使用条款。 Be a good web-scraping citizen .
接下来,您可以设置User-Agent
header 来假装是浏览器。在 DEFAULT_REQUEST_HEADERS
设置中提供 User-Agent
:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
}
或者,您可以使用中间件轮换用户代理。这是我基于 fake-useragent
包实现的:
另一个可能的问题可能是您访问网站的频率太高,请考虑调整 DOWNLOAD_DELAY
setting :
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.
还有另一个相关设置可以产生积极影响:CONCURRENT_REQUESTS
:
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
关于python - 使用scrapy爬取bbs时Twist失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29313350/