我正在尝试 scrapy,但遇到了一点困难。我希望这个脚本能够运行回调。
import scrapy
from scrapy.spiders import Spider
class ASpider(Spider):
name = 'myspider'
allowed_domains = ['wikipedia.org','en.wikipedia.org']
start_urls = ['https://www.wikipedia.org/']
def parse(self, response):
urls = response.css("a::attr('href')").extract()
for url in urls:
url = response.urljoin(url)
print("url\t",url)
scrapy.Request(url, callback=self.my_callback)
def my_callback(self,response):
print("callback called")
调用此函数的输出:
2016-05-31 16:21:26 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-05-31 16:21:26 [scrapy] INFO: Overridden settings: {}
2016-05-31 16:21:26 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2016-05-31 16:21:26 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-05-31 16:21:26 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-05-31 16:21:26 [scrapy] INFO: Enabled item pipelines:
[]
2016-05-31 16:21:26 [scrapy] INFO: Spider opened
2016-05-31 16:21:26 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-31 16:21:27 [scrapy] DEBUG: Crawled (200) <GET https://www.wikipedia.org/> (referer: None)
url https://en.wikipedia.org/
url https://es.wikipedia.org/
url https://ja.wikipedia.org/
(Long list of similar urls)
url https://meta.wikimedia.org/
2016-05-31 16:21:27 [scrapy] INFO: Closing spider (finished)
2016-05-31 16:21:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 215,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 18176,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 31, 14, 21, 27, 240038),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 31, 14, 21, 26, 328888)}
2016-05-31 16:21:27 [scrapy] INFO: Spider closed (finished)
它不运行回调。这是为什么?需要更改什么才能使回调起作用?
最佳答案
蜘蛛回调必须产生一个请求、项目或(我认为)一个字典。
In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
关于python - 解析中的 scrapy 回调未调用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37548479/