python - Scrapy 自定义导出器

标签 python scrapy

我正在定义一个将项目推送到消息队列的项目导出器。下面是代码。

from scrapy.contrib.exporter import JsonLinesItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy import log

from scrapy.conf import settings

from carrot.connection import BrokerConnection, Exchange
from carrot.messaging import Publisher

log.start()


class QueueItemExporter(JsonLinesItemExporter):

    def __init__(self, **kwargs):

        log.msg("Initialising queue exporter", level=log.DEBUG)

        self._configure(kwargs)

        host_name = settings.get('BROKER_HOST', 'localhost')
        port = settings.get('BROKER_PORT', 5672)
        userid = settings.get('BROKER_USERID', "guest")
        password = settings.get('BROKER_PASSWORD', "guest")
        virtual_host = settings.get('BROKER_VIRTUAL_HOST', "/")

        self.encoder = settings.get('MESSAGE_Q_SERIALIZER', ScrapyJSONEncoder)(**kwargs)

        log.msg("Connecting to broker", level=log.DEBUG)
        self.q_connection = BrokerConnection(hostname=host_name, port=port,
                        userid=userid, password=password,
                        virtual_host=virtual_host)
        self.exchange = Exchange("scrapers", type="topic")
        log.msg("Connected", level=log.DEBUG)

    def start_exporting(self):
        spider_name = "test"
        log.msg("Initialising publisher", level=log.DEBUG)
        self.publisher = Publisher(connection=self.q_connection,
                        exchange=self.exchange, routing_key="scrapy.spider.%s" % spider_name)
        log.msg("done", level=log.DEBUG)

    def finish_exporting(self):
        self.publisher.close()

    def export_item(self, item):
        log.msg("In export item", level=log.DEBUG)
        itemdict = dict(self._get_serialized_fields(item))
        self.publisher.send({"scraped_data": self.encoder.encode(itemdict)})
        log.msg("sent to queue - scrapy.spider.naukri", level=log.DEBUG)

我遇到了一些问题。项目未提交到队列。我在我的设置中添加了以下内容:

FEED_EXPORTERS = {
    "queue": 'scrapers.exporters.QueueItemExporter'
}

FEED_FORMAT = "queue"

LOG_STDOUT = True

代码没有引发任何错误,我也看不到任何日志消息。我对如何调试它无能为力。

如有任何帮助,我们将不胜感激。

最佳答案

“Feed Exporters”是调用一些“标准”项目导出器的快速(但不知何故很脏)的快捷方式。不要从设置中设置提要导出器,而是将自定义项目导出器硬连接到自定义管道,如此处所述http://doc.scrapy.org/en/0.14/topics/exporters.html#using-item-exporters :

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class MyPipeline(object):

    def __init__(self):
        ...
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        ...

    def spider_opened(self, spider):
        self.exporter = QueueItemExporter()
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()

    def process_item(self, item, spider):
        # YOUR STUFF HERE
        ...
        self.exporter.export_item(item)
        return item

关于python - Scrapy 自定义导出器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8911162/

相关文章:

python - 使用数组和组合模式查找组合

http - Scrapy 使用随机代理池以避免被禁止

python - Insert 语句不适用于数据类型不匹配的情况

python - 试图找到前 5 个最常见的条目

python - 如何以扭曲的方式返回 json 响应?

Python/Scipy - 与 Quad Along Axis 集成

selenium - 如何为selenium和Scrapy编写自定义下载器中间件?

python - 使用任意长度索引设置嵌套列表中的元素

python - Scrapy,Pycharm中导入其他文件问题

python - 使用 scrapy 获取 URL 的类型类别