我正在编写一个CrawlSpider
来解析Google搜索结果。搜索查询每次都会发生变化,因此蜘蛛必须首先连接到数据库以收集需要解析的搜索查询的信息。这是我带注释的 CrawlSpider
类:
class GoogleSpider(CrawlSpider):
name = 'googlespider'
allowed_domains = ['google.com', 'google.ca', 'google.fr']
logger = log
_google_query = "http://www.google.{0}/search?q={1}"
def __init__(self, *args, **kwargs):
super(GoogleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.get_startup_params, signals.spider_opened)
@defer.inlineCallbacks
def get_startup_params(self, spider, **kw):
# Get the exact requests to issue to google
exreqs = yield get_exactrequests()
# Create the google query (i.e. url to scrape) and store associated information
start_urls = []
self.item_lookup = {}
for keyword, exact_request, lang in exreqs['res']:
url = self.mk_google_query(lang, exact_request)
start_urls.append(url)
self.item_lookup[url] = (keyword, exact_request)
# Assign the google query URLs to `start_urls`
self.start_urls = tuple(start_urls)
def mk_google_query(self, lang, search_terms):
return self._google_query.format(lang, quote(search_terms))
def parse_item(self, response):
sel = Selector(response)
item = Item()
keyword, exact_request = self.item_lookup[response.request.url]
item['urls'] = map(lambda r: r.extract(),
sel.xpath('//h3[@class="r"]/a/@href'))
item['keyword'] = keyword
item['exactrequest'] = exact_request
return item
当我运行 scrapy scrapy googlespider
时,我得到一个大量日志输出,如下所示:
[-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100
这个输出持续了(我估计)10,000 行——远远超出了我的终端的回滚。
有人知道问题可能是什么以及我应该如何诊断/修复它吗?
谢谢!
最佳答案
很难说,因为你的日志实际上什么也没说,但建议使用以下内容:
- 加载start_urls的方式似乎不必要地复杂,scrapy有现成的start_requests如果您的网址生成需要额外的工作,您可以覆盖该函数
- 你是故意继承
CrawlerSpider
的吗?因为您似乎没有声明任何rule
我认为您应该从Spider
继承
关于python - 为什么 scrapy 转储数千条 `ERROR` 日志消息而没有任何错误描述?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21818337/