我有下一个型号: 命令'collect'(collect_positions.py)-> Celery任务(tasks.py)-> ScrappySpider(MySpider)...
收集位置.py:
from django.core.management.base import BaseCommand
from tracker.models import Keyword
from tracker.tasks import positions
class Command(BaseCommand):
help = 'collect_positions'
def handle(self, *args, **options):
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 1
keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)
chunks_list = list(chunks(keywords, chunk_size))
positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')
return 0
任务.py:
from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess
@app.task
def positions(*args):
s = Settings()
s.setmodule(scrapy_settings)
keywords = Keyword.objects.filter(id__in=list(args))
process = CrawlerProcess(s)
process.crawl(MySpider, keywords_chunk=keywords)
process.start()
return 1
我通过命令行运行该命令,这会创建用于解析的任务。第一个队列成功完成,但其他队列返回错误:
twisted.internet.error.ReactorNotRestartable
请告诉我如何修复此错误? 如果有需要我可以提供任何数据...
更新1
感谢您的回答,@Chiefir!我成功运行了所有队列,但仅启动了 start_requests() 函数,并且 parse() 未运行。
scrapy Spider的主要功能:
def start_requests(self):
print('STEP1')
yield scrapy.Request(
url='exmaple.com',
callback=self.parse,
errback=self.error_callback,
dont_filter=True
)
def error_callback(self, failure):
print(failure)
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
print(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
print('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
print('TimeoutError on %s', request.url)
def parse(self, response):
print('STEP2', response)
在控制台中我得到:
STEP1
可能是什么原因?
最佳答案
这是一个世界老问题:
这帮助我赢得了与 ReactorNotRestartable 错误的战斗:last answer from the author of the question
0) pip install crochet
1) 从钩针导入设置导入
2) setup()
- 在文件顶部
3)删除2行:
a) d.addBoth(lambda _:reactor.stop())
b) reactor.run()
我遇到了与此错误相同的问题,并花了 4 个多小时来解决此问题,请阅读此处有关此问题的所有问题。终于找到了-并分享它。这就是我解决这个问题的方法。 Scrapy docs 中唯一有意义的行左边是我的代码中的最后两行:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
此代码允许我选择要运行的蜘蛛,只需将其名称传递给 run_spider
函数,并在抓取完成后 - 选择另一个蜘蛛并再次运行它。
在您的情况下,您需要在单独的文件中创建单独的函数来运行您的蜘蛛并从您的任务
运行它。通常我会这样做:)
P.S.而且确实没有办法重新启动TwistedReactor
。
更新1
我不知道您是否需要调用 start_requests()
方法。对我来说,它通常只适用于以下代码:
class mySpider(scrapy.Spider):
name = "somname"
allowed_domains = ["somesite.com"]
start_urls = ["https://somesite.com"]
def parse(self, response):
pass
def parse_dir_contents(self, response): #for crawling additional links
pass
关于python - Django Celery Scrappy 错误 : twisted. internet.error.ReactorNotRestartable,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50140887/