我正在使用脚本文件在 scrapy 项目中运行蜘蛛,蜘蛛正在记录爬虫输出/结果。但是我想在某个函数的脚本文件中使用蜘蛛输出/结果。我不想将输出/结果保存在任何文件或数据库中。
这是来自 https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script 的脚本代码
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
# do something to that output
如何在 'spider_output' 方法中获得蜘蛛输出。可以获得输出/结果。
最佳答案
这是在列表中获取所有输出/结果的解决方案
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_scraped)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
关于python - 在脚本文件函数中获取 Scrapy 爬虫输出/结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40237952/