目前我将 Scrapy 与多处理一起使用。我做了一个 POC,以运行许多蜘蛛。 我的代码看起来像这样:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from multiprocessing import Lock, Process, Queue, current_process
def worker(work_queue, done_queue):
try:
for url in iter(work_queue.get, 'STOP'):
status_code = run_spider(action)
except Exception, e:
done_queue.put("%s failed on %s with: %s" % (current_process().name, action, e.message))
return True
def run_spider(action):
os.system(action)
def main():
sites = (
scrapy crawl level1 -a url='https://www.example.com/test.html',
scrapy crawl level1 -a url='https://www.example.com/test1.html',
scrapy crawl level1 -a url='https://www.example.com/test2.html',
scrapy crawl level1 -a url='https://www.example.com/test3.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
)
workers = 2
work_queue = Queue()
done_queue = Queue()
processes = []
for action in sites:
work_queue.put(action)
for w in xrange(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for status in iter(done_queue.get, 'STOP'):
print status
if __name__ == '__main__':
main()
根据您的说法,运行多个 Scrapy 实例的最佳解决方案是什么?
为每个 URL 启动一个 Scrapy 实例或启动一个带有 x URL 的蜘蛛(例如:1 个具有 100 个链接的蜘蛛)会更好吗?
最佳答案
It would be better to launch a Scrapy instance for each URL or launch a spider with x URL (ex: 1 spider with 100 links) ?
启动一个 Scrapy 实例绝对是一个糟糕的选择,因为对于每个 URL,你都会遭受 Scrapy 本身的开销。
我认为最好将 URL 平均分配给蜘蛛。
关于python - 使用多处理运行多个 Scrapy 的最佳方式是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32007291/