python - 如何将 scrapy 爬虫的数据保存到变量中?

标签 python scrapy

我目前正在构建一个网络应用程序,用于显示 scrapy 蜘蛛收集的数据。用户发出请求,蜘蛛抓取一个网站,然后将数据返回给应用程序以便得到提示。我想直接从 scraper 检索数据,而不依赖于中间 .csv 或 .json 文件。像这样的东西:

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
crawler.crawl(spider, start_urls=[url])
crawler.start()
data = crawler.data # this bit

最佳答案

这并不容易,因为 Scrapy 是非阻塞的并且在事件循环中工作;它使用 Twisted 事件循环,而 Twisted 事件循环是不可重启的,所以你不能写 crawler.start(); data = crawler.data - 在 crawler.start() 进程永远运行之后,调用已注册的回调,直到它被杀死或结束。

这些答案可能是相关的:

如果您在应用程序中使用事件循环(例如,您有 Twisted 或 Tornado 网络服务器),则可以从爬网中获取数据而不将其存储到磁盘。这个想法是听 item_scraped 信号。我正在使用以下助手来让它变得更好:

import collections

from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals

def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
    """
    Start a crawl and return an object (ItemCursor instance)
    which allows to retrieve scraped items and wait for items
    to become available.

    Example:

    .. code-block:: python

        @inlineCallbacks
        def f():
            runner = CrawlerRunner()
            async_items = scrape_items(runner, my_spider)
            while (yield async_items.fetch_next):
                item = async_items.next_item()
                # ...
            # ...

    This convoluted way to write a loop should become unnecessary
    in Python 3.5 because of ``async for``.
    """
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)    
    d = crawler_runner.crawl(crawler, *args, **kwargs)
    return ItemCursor(d, crawler)


class ItemCursor(object):
    def __init__(self, crawl_d, crawler):
        self.crawl_d = crawl_d
        self.crawler = crawler

        crawler.signals.connect(self._on_item_scraped, signals.item_scraped)

        crawl_d.addCallback(self._on_finished)
        crawl_d.addErrback(self._on_error)

        self.closed = False
        self._items_available = Deferred()
        self._items = collections.deque()

    def _on_item_scraped(self, item):
        self._items.append(item)
        self._items_available.callback(True)
        self._items_available = Deferred()

    def _on_finished(self, result):
        self.closed = True
        self._items_available.callback(False)

    def _on_error(self, failure):
        self.closed = True
        self._items_available.errback(failure)

    @property
    def fetch_next(self):
        """
        A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
        asynchronously retrieve the next item, waiting for an item to be
        crawled if necessary. Resolves to ``False`` if the crawl is finished,
        otherwise :meth:`next_item` is guaranteed to return an item
        (a dict or a scrapy.Item instance).
        """
        if self.closed:
            # crawl is finished
            d = Deferred()
            d.callback(False)
            return d

        if self._items:
            # result is ready
            d = Deferred()
            d.callback(True)
            return d

        # We're active, but item is not ready yet. Return a Deferred which
        # resolves to True if item is scraped or to False if crawl is stopped.
        return self._items_available

    def next_item(self):
        """Get a document from the most recently fetched batch, or ``None``.
        See :attr:`fetch_next`.
        """
        if not self._items:
            return None
        return self._items.popleft()

API 的灵感来自 motor ,用于异步框架的 MongoDB 驱动程序。使用 scrape_items,您可以在抓取项目后立即从 twisted 或 tornado 回调中获取项目,其方式类似于从 MongoDB 查询中获取项目的方式。

关于python - 如何将 scrapy 爬虫的数据保存到变量中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40715369/

相关文章:

python - 在 Jupyter Notebook 中循环更新绘图

python - 向前和向后遍历字符串,提取交替字符

python - 如何在 Synology 上安装 Pip 3,然后安装 Python Lib MySQLdb

python - 使用 scrapy 版本 0.22.1 进行多页抓取 - "cannot import name CrawlSpider"错误是什么意思?

python - 如何在单个项目中获取scrapy xpath输出

python - 操作列表中的值 (URL) - Python

PIG 的 Python UDF 给出错误

python - 如何在打字时自动更正QLineEdit?

python - extract href scrapy - 爬行但不提取

python - Scrapy 检索文本编码不正确,希伯来语为\u0d5 等