这是我用来调用scrapy的python脚本,
的答案Scrapy crawl from script always blocks script execution after scraping
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
这是我的 pipelines.py 代码
from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher
class scrapermar11Pipeline(object):
def __init__(self):
self.files = {}
dispatcher.connect(self.spider_opened , signals.spider_opened)
dispatcher.connect(self.spider_closed , signals.spider_closed)
def spider_opened(self,spider):
file = open('links_pipelines.json' ,'wb')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self,spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
log.msg('It reached here')
return item
这段代码取自这里
Scrapy :: Issues with JSON export
当我这样运行爬虫的时候
scrapy crawl MySpider -a start_url='abc'
创建了一个具有预期输出的链接文件。但是当我执行 python 脚本时,它不会创建任何文件,尽管爬虫运行时转储的 scrapy 统计信息与之前运行的统计信息相似。 我认为 python 脚本中存在错误,因为文件是在第一种方法中创建的。如何让脚本输出文件?
最佳答案
这段代码对我有用:
from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'}
settings.overrides.update(mySettings)
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool
dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
关于python - 从不创建 JSON 输出文件的 python 脚本调用 scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15483898/