我正在定义一个将项目推送到消息队列的项目导出器。下面是代码。
from scrapy.contrib.exporter import JsonLinesItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy import log
from scrapy.conf import settings
from carrot.connection import BrokerConnection, Exchange
from carrot.messaging import Publisher
log.start()
class QueueItemExporter(JsonLinesItemExporter):
def __init__(self, **kwargs):
log.msg("Initialising queue exporter", level=log.DEBUG)
self._configure(kwargs)
host_name = settings.get('BROKER_HOST', 'localhost')
port = settings.get('BROKER_PORT', 5672)
userid = settings.get('BROKER_USERID', "guest")
password = settings.get('BROKER_PASSWORD', "guest")
virtual_host = settings.get('BROKER_VIRTUAL_HOST', "/")
self.encoder = settings.get('MESSAGE_Q_SERIALIZER', ScrapyJSONEncoder)(**kwargs)
log.msg("Connecting to broker", level=log.DEBUG)
self.q_connection = BrokerConnection(hostname=host_name, port=port,
userid=userid, password=password,
virtual_host=virtual_host)
self.exchange = Exchange("scrapers", type="topic")
log.msg("Connected", level=log.DEBUG)
def start_exporting(self):
spider_name = "test"
log.msg("Initialising publisher", level=log.DEBUG)
self.publisher = Publisher(connection=self.q_connection,
exchange=self.exchange, routing_key="scrapy.spider.%s" % spider_name)
log.msg("done", level=log.DEBUG)
def finish_exporting(self):
self.publisher.close()
def export_item(self, item):
log.msg("In export item", level=log.DEBUG)
itemdict = dict(self._get_serialized_fields(item))
self.publisher.send({"scraped_data": self.encoder.encode(itemdict)})
log.msg("sent to queue - scrapy.spider.naukri", level=log.DEBUG)
我遇到了一些问题。项目未提交到队列。我在我的设置中添加了以下内容:
FEED_EXPORTERS = {
"queue": 'scrapers.exporters.QueueItemExporter'
}
FEED_FORMAT = "queue"
LOG_STDOUT = True
代码没有引发任何错误,我也看不到任何日志消息。我对如何调试它无能为力。
如有任何帮助,我们将不胜感激。
最佳答案
“Feed Exporters”是调用一些“标准”项目导出器的快速(但不知何故很脏)的快捷方式。不要从设置中设置提要导出器,而是将自定义项目导出器硬连接到自定义管道,如此处所述http://doc.scrapy.org/en/0.14/topics/exporters.html#using-item-exporters :
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class MyPipeline(object):
def __init__(self):
...
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
...
def spider_opened(self, spider):
self.exporter = QueueItemExporter()
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
def process_item(self, item, spider):
# YOUR STUFF HERE
...
self.exporter.export_item(item)
return item
关于python - Scrapy 自定义导出器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8911162/