python - 如何在python icrawler中使用搜索关键字重命名爬虫文件

标签 python web-scraping web-crawler google-image-search

我正在使用爬虫从google下载一些图像,但我发现下载文件的名称为000001.jpg、000002.jpg。但我异常(exception)的是,当图片下载时,同时将其名称更改为关键字(例如 Coláiste Íosagáin,jpg、Mount Anville secondary school.jpg、St Laurence College.jpg)

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    parser_threads=2,
    downloader_threads=4,
    storage={'root_dir': 'images'})

for keyword in ['Coláiste Íosagáin', 'Mount Anville Secondary School', 'St Laurence College']:
    google_crawler.crawl(
        keyword=keyword, max_num=1, min_size=(800, 800), max_size=(1200, 1200)) 

最佳答案

可以通过重写几个类以将关键字参数传递给 get_filename 来实现此目的。这可能不是最直接的方法,但它确实有效。

首先,您需要重写 Google Image Crawler 的抓取方法,将关键字 arg 传递给下载器。添加一行需要大量复制/粘贴:

from icrawler.builtin import GoogleImageCrawler
from icrawler import ImageDownloader
from PIL import Image
from six import BytesIO

class KeywordGoogleImageCrawler(GoogleImageCrawler):

    def crawl(self,
              keyword,
              filters=None,
              offset=0,
              max_num=1000,
              min_size=None,
              max_size=None,
              language=None,
              file_idx_offset=0,
              overwrite=False):
        if offset + max_num > 1000:
            if offset > 1000:
                self.logger.error(
                    '"Offset" cannot exceed 1000, otherwise you will get '
                    'duplicated searching results.')
                return
            elif max_num > 1000:
                max_num = 1000 - offset
                self.logger.warning(
                    'Due to Google\'s limitation, you can only get the first '
                    '1000 result. "max_num" has been automatically set to %d. '
                    'If you really want to get more than 1000 results, you '
                    'can specify different date ranges.', 1000 - offset)

        feeder_kwargs = dict(
            keyword=keyword,
            offset=offset,
            max_num=max_num,
            language=language,
            filters=filters)
        downloader_kwargs = dict(
            keyword=keyword,  #<<< add this line
            max_num=max_num,
            min_size=min_size,
            max_size=max_size,
            file_idx_offset=file_idx_offset,
            overwrite=overwrite)
        super(GoogleImageCrawler, self).crawl(
            feeder_kwargs=feeder_kwargs, downloader_kwargs=downloader_kwargs)

然后,您需要重写 Downloader 类:

  1. 更新 get_filename 以接受关键字作为参数
  2. 更新 get_filename 以在文件名中包含关键字
  3. 更新 keep_file 以包含 **kwargs 参数,这样您就不会收到“意外的关键字参数”错误
  4. 更新下载方法,将关键字传递给 get_filename 调用

class KeywordNameDownloader(ImageDownloader):

def get_filename(self, task, default_ext, keyword):
    filename = super(KeywordNameDownloader, self).get_filename(
        task, default_ext)
    return keyword + filename

def keep_file(self, task, response, min_size=None, max_size=None, **kwargs):
    """Decide whether to keep the image

    Compare image size with ``min_size`` and ``max_size`` to decide.

    Args:
        response (Response): response of requests.
        min_size (tuple or None): minimum size of required images.
        max_size (tuple or None): maximum size of required images.
    Returns:
        bool: whether to keep the image.
    """
    try:
        img = Image.open(BytesIO(response.content))
    except (IOError, OSError):
        return False
    task['img_size'] = img.size
    if min_size and not self._size_gt(img.size, min_size):
        return False
    if max_size and not self._size_lt(img.size, max_size):
        return False
    return True

def download(self,
             task,
             default_ext,
             timeout=5,
             max_retry=3,
             overwrite=False,
             **kwargs):
    """Download the image and save it to the corresponding path.

    Args:
        task (dict): The task dict got from ``task_queue``.
        timeout (int): Timeout of making requests for downloading images.
        max_retry (int): the max retry times if the request fails.
        **kwargs: reserved arguments for overriding.
    """
    file_url = task['file_url']
    task['success'] = False
    task['filename'] = None
    retry = max_retry
    keyword = kwargs['keyword']

    if not overwrite:
        with self.lock:
            self.fetched_num += 1
            filename = self.get_filename(task, default_ext,keyword)
            if self.storage.exists(filename):
                self.logger.info('skip downloading file %s', filename)
                return
            self.fetched_num -= 1

    while retry > 0 and not self.signal.get('reach_max_num'):
        try:
            response = self.session.get(file_url, timeout=timeout)
        except Exception as e:
            self.logger.error('Exception caught when downloading file %s, '
                              'error: %s, remaining retry times: %d',
                              file_url, e, retry - 1)
        else:
            if self.reach_max_num():
                self.signal.set(reach_max_num=True)
                break
            elif response.status_code != 200:
                self.logger.error('Response status code %d, file %s',
                                  response.status_code, file_url)
                break
            elif not self.keep_file(task, response, **kwargs):
                break
            with self.lock:
                self.fetched_num += 1
                filename = self.get_filename(task, default_ext,keyword)
            self.logger.info('image #%s\t%s', self.fetched_num, file_url)
            self.storage.write(filename, response.content)
            task['success'] = True
            task['filename'] = filename
            break
        finally:
            retry -= 1

毕竟,你可以运行这样的东西:

google_crawler = KeywordGoogleImageCrawler(downloader_cls=KeywordNameDownloader, storage={'root_dir': 'image_downloads'})
google_crawler.crawl(keyword='cat', max_num=4)

并得到如下输出:

 Directory of C:\<my local path>\image_downloads

12/09/2019  09:57 AM    <DIR>          .
12/09/2019  09:57 AM    <DIR>          ..
12/09/2019  09:57 AM         1,246,129 cat000001.png
12/09/2019  09:57 AM         2,627,334 cat000002.jpg
12/09/2019  09:57 AM           127,213 cat000003.jpg
12/09/2019  09:57 AM           789,779 cat000004.jpg

关于python - 如何在python icrawler中使用搜索关键字重命名爬虫文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51495087/

相关文章:

python - 2个数字列表之间的余弦相似度

python - 构建成功后 pip 不删除源

python - 无法使用 OpenCV 使用 cv2.CAP_FFMPEG 在 GPU 上解码视频

python - 递归搜索

python - 基于 gui 的脚本,与 ajax/http 交互以进行网络抓取/抓取

python - 将变量值附加到谷歌表格中的特定列

golang 强制 net/http 客户端使用 IPv4/IPv6

python - 如何在python中使用mechanize将字符串输入到表单的某个部分?

rss - RSS/Feed 文件的平均大小,用于数据存储和带宽计算

web-crawler - 如何从谷歌的索引中排除网页的一部分?