python - 数据库存储 : Why is Pipeline better than Feed Export?

这是一个关于scrapy的问题.

在数据库中存储项目时，为什么传统上是通过管道而不是 Feed Export 机制来实现？

Feed Exports - Output your scraped data using different formats and storages

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly

Item Pipeline - Post-process and store your scraped data

Typical use for item pipelines are... storing the scraped item in a database

两者的区别、优缺点以及(为什么)管道更合适？

谢谢

最佳答案

这个回答太迟了。但我只是花了整个下午和一个晚上试图了解 item pipeline 和 feed export 之间的区别，后者的文档很少。我认为这对仍然感到困惑的人会有帮助。

长话短说: FeedExport 专为将项目导出为文件而设计。完全不适合做数据库存储。

提要导出在 scrapy.extensions.feedexport 中作为 scrapy 的扩展实现.这样，就像scrapy中的其他扩展一样，它又通过向一些scrapy信号( open_spider ， close_spider 和 item_scraped )注册回调函数来实现，以便它可以采取必要的步骤来存储项目。

当 open_spider , FeedExporter (实际的扩展类)初始化提要存储和项目导出器。具体步骤涉及从 FeedStroage 中获取一个类文件对象，该对象通常是一个临时文件。并将其传递给 ItemExporter .当item_scraped , FeedExporter只需调用预初始化的 ItemExporter反对 export_item .当close_spider , FeedExporter调用store上一个方法 FeedStorage对象将文件写入文件系统、上传到远程 FTP 服务器、上传到 S3 存储等。

有一组内置的项目导出器和存储。但是您可能会从上面的文字中注意到，FeedExporter在设计上与文件存储紧密结合。使用数据库时，存储项目的常用方法是在它被抓取后立即将其插入数据库(或者您可能需要一些缓冲区)。

因此，使用数据库存储的正确方法似乎是自己编写 FeedExporter .您可以通过注册回调到 scrapy 信号来实现它。但这不是必需的，使用项目管道更直接，不需要了解此类实现细节。

关于python - 数据库存储 : Why is Pipeline better than Feed Export?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10205294/

python - 数据库存储 : Why is Pipeline better than Feed Export?

上一篇：python - 禁用标准。和 Python 沙箱实现中的文件 I/O

下一篇：python - 遍历图中所有边的算法