python - 无法使用相对 URL Python Scrapy 下载图像

标签 python web-crawler scrapy

我正在使用 Scrapy 从 http://www.vesselfinder.com/vessels 下载图像

但是,我只能得到像这样的图片的相对url http://www.vesselfinder.com/vessels/ship-photo/0-227349190-7c01e2b3a7a5078ea94fff9a0f862f8a/0

所有名为 0.jpg 的图像,但如果我尝试使用该绝对 url,我将无法访问该图像。

我的代码: items.py

import scrapy

class VesselItem(scrapy.Item):
    name = scrapy.Field()
    nationality = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

pipelines.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem

class VesselPipeline(object):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

vessel_spider.py

import scrapy
import string

from vessel.items import VesselItem

class VesselSpider(scrapy.Spider):
    """docstring for VesselSpider"""
    name = "vessel"
    allowed_domains = ["vesselfinder.com"]
    page_name = "http://vesselfinder.com"
    start_urls = [
        # "http://vesselfinder.com/vessels?page=%d" %i for i in range(0,1000)
        "http://vesselfinder.com/vessels"
    ]

    def parse(self, response):
        f = open('vessels.txt', 'a')
        count = 0;

        for sel in response.xpath('//div[@class="items"]/article'):
            item = VesselItem()

            imageStr = str(sel.xpath('div[1]/a/picture/img/@src').extract())
            item['image_urls'] = self.page_name + imageStr[3:-2]
            nameStr = str(sel.xpath('div[2]/header/h1/a/text()').extract())
            item['name'] = nameStr[19:-8]
            typeStr = str(sel.xpath('div[2]/div[2]/div[2]/text()').extract())
            item['type'] = typeStr[3:-2]

            return item

当我运行这个蜘蛛时,我得到了 exceptions.ValueError: Missing scheme in request url: h 错误,因为我没有提供绝对 url。

[vessel] ERROR: Error processing {'image_urls': 'http://vesselfinder.com/vessels/ship-photo/0-224138470-a2fdc783d05a019d00ad9db0cef322f7/0.jpg',
     'name': 'XILGARO ALEANTE',
     'type': 'Sailing vessel'}
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 383, in callback
        self._startRunCallbacks(result)
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 491, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 578, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/pipeline/media.py", line 40, in process_item
        requests = arg_to_iter(self.get_media_requests(item, info))
      File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/pipeline/images.py", line 104, in get_media_requests
        return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
      File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __init__
        self._set_url(url)
      File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _set_url
        raise ValueError('Missing scheme in request url: %s' % self._url)
    exceptions.ValueError: Missing scheme in request url: h

我该如何解决这个问题。有没有什么特殊的方法可以获取像这样的站点的图像(或其绝对 url)。

最佳答案

将您的图片 url 包装在一个列表中,如下所示:

item['image_urls'] = [self.page_name + imageStr[3:-2]]

关于python - 无法使用相对 URL Python Scrapy 下载图像,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30069761/

相关文章:

javascript - 如何使用 beautifulsoup 从 js 和 Reactjs 获取数据?

javascript - NodeJS x-ray web-scraper : how to follow links and get content from sub page

python - 使用 scrapy : defining path to Django project 访问 Django 模型

python - 在 Django 中建模复杂的关系

python - 通过调用父类(super class)python创建子类

python - Scrapy:拒绝/忽略具有特定类属性的链接

python - Scrapy 的代理池系统暂时停止使用慢速/超时代理

Python正则表达式匹配列表但不与字典列表

python - 分数函数返回未减少的分数

python - 使用 Scrapy 在搜索查询中发送 POST 请求