django - 在 Celery 任务中运行 Scrapy 蜘蛛(django 项目)

标签 django interface scrapy admin celery

我正在尝试从 django 项目运行 scrapy (spider/crawl)(使用 celery 管理界面中的任务)。这是我的代码。 这是当我尝试从 python shell error image cmd 调用任务时出现的错误

django项目:

-monapp:        

   -tasks.py
   -spider.py
   -myspider.py            '
   -models.py
         .....

任务.py:

  from djcelery import celery
  from demoapp.spider import *
  from demoapp.myspider import *

  @celery.task
  def add(x, y):
    return x + y

  @celery.task
  def scra():
        result_queue = Queue()
        crawler = CrawlerWorker(MySpider(), result_queue)
        crawler.start()
        return "success"

蜘蛛.py:

         from scrapy import project, signals
         from scrapy.settings import Settings
         from scrapy.crawler import Crawler
         from scrapy.xlib.pydispatch import dispatcher
         from multiprocessing.queues import Queue
         import multiprocessing

         class CrawlerWorker(multiprocessing.Process):

            def __init__(self, spider, result_queue):
                multiprocessing.Process.__init__(self)
                self.result_queue = result_queue
                self.crawler = Crawler(Settings())
                if not hasattr(project, 'crawler'):
                self.crawler.install()
                self.crawler.configure()

                self.items = []
                self.spider = spider
                dispatcher.connect(self._item_passed, signals.item_passed)

             def _item_passed(self, item):
                self.items.append(item)

             def run(self):
                self.crawler.crawl(self.spider)
                self.crawler.start()
                self.crawler.stop()
                self.result_queue.put(self.items)

myspider.py

        from scrapy.selector import HtmlXPathSelector
        from scrapy.contrib.spiders import CrawlSpider, Rule
        from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
        from scrapy.item import Item, Field

        class TorentItem(Item):

         title = Field()
         desc = Field()
        class MySpider(CrawlSpider):
         name = 'job'
         allowed_domains = ['tanitjobs.com']
         start_urls = [\
                 'http://tanitjobs.com/browse-by-category/Nurse/',]
         rules = (
        Rule (SgmlLinkExtractor(allow=('page=*',)
                      ,restrict_xpaths=('//div[@class="pageNavigation"]',), 
                       unique = True)
           , callback='parse_item', follow= True),
             )
        def parse_item(self, response):
           hxs = HtmlXPathSelector(response)
           items= hxs.select('\
                     //div[@class="offre"]/div[@class="detail"]')
           scraped_items =[]

               for item in items:
                 scraped_item = TorentItem()

                         scraped_item['title']=item.select(\
                               'a/strong/text()').extract() 
                 scraped_item['desc'] =item.select(\
                          './div[@class="descriptionjob"]/text()').extract()

                 scraped_items.append(scraped_item) 
                 return scraped_items 

最佳答案

我使用 django 管理命令在 shell 上让它工作。下面是我的代码片段。请随意修改以满足您的需求。

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings

from django.core.management.base import BaseCommand

from myspiderproject.spiders.myspider import MySpider

class ReactorControl:
    def __init__(self):
    self.crawlers_running = 0

    def add_crawler(self):
        self.crawlers_running += 1

    def remove_crawler(self):
        self.crawlers_running -= 1
        if self.crawlers_running == 0:
            reactor.stop()

def setup_crawler(domain):
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.signals.connect(reactor_control.remove_crawler, signal=signals.spider_closed)

    spider = MySpider(domain=domain)
    crawler.crawl(spider)
    reactor_control.add_crawler()
    crawler.start()

reactor_control = ReactorControl()

class Command(BaseCommand):
    help = 'Crawls the site'

    def handle(self, *args, **options):
        setup_crawler('somedomain.com')
        reactor.run()  # the script will block here until the spider_closed signal was sent

希望这有帮助。

关于django - 在 Celery 任务中运行 Scrapy 蜘蛛(django 项目),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22578145/

相关文章:

java - 为什么方法返回接口(interface)类型的对象?

python - scrapy-splash 事件内容选择器在 shell 中工作,但不适用于蜘蛛

python /碎片 : Custom pipeline has no effect/download files with custom filename

django - 无法将本地 Postgre 数据库推送到 Heroku

python - 如何在不更改类型的情况下从 SearchQuerySet 获取 n 个搜索对象?

python - 扩展用户模型 Django REST framework 3.x.x

python - 使用 scrapy 抓取多个页面

python - 我如何指定一个数据库供 Django 测试使用,而不是每次都构建它?

c++ - 代表类型的变量?运行时继承?

go - 访问接口(interface)内的结构值