python - 从不创建 JSON 输出文件的 python 脚本调用 scrapy

标签 python json web-crawler scrapy

这是我用来调用scrapy的python脚本,

的答案

Scrapy crawl from script always blocks script execution after scraping

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

这是我的 pipelines.py 代码

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

这段代码取自这里

Scrapy :: Issues with JSON export

当我这样运行爬虫的时候

scrapy crawl MySpider -a start_url='abc'

创建了一个具有预期输出的链接文件。但是当我执行 python 脚本时,它不会创建任何文件,尽管爬虫运行时转储的 scrapy 统计信息与之前运行的统计信息相似。 我认为 python 脚本中存在错误,因为文件是在第一种方法中创建的。如何让脚本输出文件?

最佳答案

这段代码对我有用:

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

关于python - 从不创建 JSON 输出文件的 python 脚本调用 scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15483898/

相关文章:

python - 如果多次使用 RDD 是否需要缓存?

json - 输出多行 bash 变量以 curl json

elasticsearch - Stormcrawler 缓慢且高延迟,爬行 300 个域

php - 如何获取特定站点内的所有页面链接?

python - 绳索自动导入不起作用

python - 如何从数据框中获取字符串

python - 如果 Pandas 从多列返回值等于另一列中的值

python - 如何使用 tqdm 实现 JSON 文件加载进度条?

jquery - Laravel 路由中的 Ajax 调用返回包含 ChartJS 标签日期的 json 编码数组

python - 如何在 scrapy - python 中将多个 URL 保存为每个 StartURL 文件?