python - 使用基于项目的自定义路径下载 Scrapy 文件

我想我想做的是非常基本的，但我找不到实现它的方法。

我正在尝试使用 FilesPipeline在 scrapy 中以下载文件(例如 Image1.jpg)并将其保存在相对于首先放置该请求的项目(例如 item.name)的路径中。

这个问题很相似here ，尽管我想将 item.name 或 item.something 字段作为参数传递，以便根据 item.name 将每个文件保存在自定义路径中。

路径在 persist_file 函数中定义，但该函数无权访问项目本身，只能访问文件请求和响应。

def get_media_requests(self, item, info):
    return [Request(x) for x in item.get(self.FILES_URLS_FIELD, [])]
I can also see above, that the request is made here in order to process the files into the pipeline, but is there a way to pass an extra argument in order to later use it on the file_downloaded and afterwards on persist_file function?

作为最终的解决方案，在以下管道之一下载文件后重命名/移动文件会非常简单，但它看起来很草率，不是吗？

我正在使用实现的代码 here作为自定义管道。

有人可以帮忙吗？提前谢谢你:)

最佳答案

创建自己的管道(继承自 FilesPipeline)覆盖管道的 process_item 方法，将当前项传递给其他函数

def process_item(self, item, spider):
    info = self.spiderinfo
    requests = arg_to_iter(self.get_media_requests(item, info))
    dlist = [self._process_request(r, info, item) for r in requests]
    dfd = DeferredList(dlist, consumeErrors=1)
    return dfd.addCallback(self.item_completed, item, info)

然后您需要覆盖 _process_request 并继续传递 item 参数以在创建文件路径时使用它。

关于python - 使用基于项目的自定义路径下载 Scrapy 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33549187/

python - 使用基于项目的自定义路径下载 Scrapy 文件

上一篇：python - Pandas，检查 datetimeindex 的重采样 30 分钟时间段中是否存在时间戳值

下一篇：python - 在 Flask-Login token_loader 中加载 token 引发 "BadTimeSignature: timestamp missing"