python - 使用多处理运行多个 Scrapy 的最佳方式是什么？

目前我将 Scrapy 与多处理一起使用。我做了一个 POC，以运行许多蜘蛛。我的代码看起来像这样:

#!/usr/bin/python 
# -*- coding: utf-8 -*-
from multiprocessing import Lock, Process, Queue, current_process

def worker(work_queue, done_queue):
    try:
        for url in iter(work_queue.get, 'STOP'):
            status_code = run_spider(action)
    except Exception, e:
        done_queue.put("%s failed on %s with: %s" % (current_process().name, action, e.message))
    return True


def run_spider(action):
    os.system(action)

def main():
    sites = (
        scrapy crawl level1 -a url='https://www.example.com/test.html',
        scrapy crawl level1 -a url='https://www.example.com/test1.html',
        scrapy crawl level1 -a url='https://www.example.com/test2.html',
        scrapy crawl level1 -a url='https://www.example.com/test3.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
    )

    workers = 2
    work_queue = Queue()
    done_queue = Queue()
    processes = []

    for action in sites:
        work_queue.put(action)

    for w in xrange(workers):
        p = Process(target=worker, args=(work_queue, done_queue))
        p.start()
        processes.append(p)
        work_queue.put('STOP')

    for p in processes:
        p.join()

    done_queue.put('STOP')

    for status in iter(done_queue.get, 'STOP'):
        print status

if __name__ == '__main__':
    main()

根据您的说法，运行多个 Scrapy 实例的最佳解决方案是什么？

为每个 URL 启动一个 Scrapy 实例或启动一个带有 x URL 的蜘蛛(例如:1 个具有 100 个链接的蜘蛛)会更好吗？

最佳答案

It would be better to launch a Scrapy instance for each URL or launch a spider with x URL (ex: 1 spider with 100 links) ?

启动一个 Scrapy 实例绝对是一个糟糕的选择，因为对于每个 URL，你都会遭受 Scrapy 本身的开销。

我认为最好将 URL 平均分配给蜘蛛。

关于python - 使用多处理运行多个 Scrapy 的最佳方式是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32007291/

python - 使用多处理运行多个 Scrapy 的最佳方式是什么？

上一篇：python - python中的搜索方法和字符串匹配

下一篇：python - 从具有初始数据的 View 中删除表单域