python - 在 Scrapy 中本地运行所有的爬虫

标签 python web-crawler scrapy

有没有办法在不使用 Scrapy 守护进程的情况下运行 Scrapy 项目中的所有爬虫?曾经有一种方法可以使用 scrapy crawl 运行多个爬虫,但该语法已被删除并且 Scrapy 的代码发生了很大变化。

我尝试创建自己的命令:

from scrapy.command import ScrapyCommand
from scrapy.utils.misc import load_object
from scrapy.conf import settings

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spman_cls = load_object(settings['SPIDER_MANAGER_CLASS'])
        spiders = spman_cls.from_settings(settings)

        for spider_name in spiders.list():
            spider = self.crawler.spiders.create(spider_name)
            self.crawler.crawl(spider)

        self.crawler.start()

但是一旦蜘蛛注册到 self.crawler.crawl(),我就会得到所有其他蜘蛛的断言错误:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run
    self.crawler.crawl(spider)
  File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl
    return self.engine.open_spider(spider, requests)
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider
    spider.name
exceptions.AssertionError: No free spider slots when opening 'spidername'

有什么办法吗?我宁愿不开始子类化核心 Scrapy 组件只是为了像这样运行我所有的蜘蛛。

最佳答案

你为什么不直接使用类似的东西:

scrapy list|xargs -n 1 scrapy crawl

?

关于python - 在 Scrapy 中本地运行所有的爬虫,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15564844/

相关文章:

python - Scrapy 跟随链接并收集电子邮件

python - Scrapy:ImportError:没有名为 project_name.settings 的模块

css - 网络抓取(抓取)时, "li: nth-child (n)"如何将数字 n 增加 +1?

python - 使用Scrapy抓取网页中的url

Python 命令 print() 不打印

python - 获取 python numpy 数组的列名

python - 如何更改 scrapy 请求队列大小?如何实现严格的 DFO 命令

python - 有没有办法在 shell 中处理 scrapy.Request 对象?

python - 如何检查Inno Setup中是否安装了特定的Python版本?

python - 从列表元组中正确赋值