python - 无法使用 Beautiful Soup 下载所有图像

标签 python beautifulsoup

我对数据抓取不太熟悉,并且无法使用 beautiful soup 下载图像。

我需要从网站下载所有图像。我正在使用下面的代码:

import re
import requests
from bs4 import BeautifulSoup

site = 'http://someurl.org/'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')

# img_tags = soup.findAll('img')
img_tags = soup.findAll('img',{"src":True})

print('img_tags: ')
print(img_tags)

urls = [img['src'] for img in img_tags]

print('urls: ')
print(urls)

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

但是,这会忽略页面上存在的所有具有与此类似的 html 的图像:

<img data-bind="attr: { src: thumbURL() }" src="/assets/images/submissions/abfc-2345345234.thumb.png">

我认为这是因为数据属性也包含字符串“src”,但我似乎无法弄清楚。

最佳答案

你需要使用selenium或者一些可以运行javascript的东西。这是代码加载图像直到找到它

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

site = 'http://phylopic.org/'
dr = webdriver.Chrome()

dr.get(site)
try:
    element = WebDriverWait(dr, 20, 0.5).until(
        EC.visibility_of_element_located((By.CLASS_NAME, "span1"))
    )
except:
    print("Wait a bit more")
    time.sleep(5)

text = dr.page_source
soup = BeautifulSoup(text,"lxml")
imgs = soup.find_all('img')
print(imgs)

dr.close()

第二个问题是如何将相对路径转换为绝对路径。 relative path 有几种类型上HTML .

当网址为 http://someurl.org/somefd/somefd2

  • <img src="picture.jpg"> http://someurl.org/somefd/somefd2/picture.jpg
  • <img src="images/picture.jpg"> http://someurl.org/somefd/somefd2/images/picture.jpg
  • <img src="/images/picture.jpg"> http://someurl.org/images/picture.jpg
  • <img src="../picture.jpg"> http://someurl.org/somefd/picture.jpg

这是我将 rp 转换为 ap 的代码。

import re

site = 'https://en.wikipedia.org/wiki/IMAGE'


def r2a(path,site=site):
    rp = re.findall(r"(/?\W{2}\/)+?",path)

    if path.find("http") == 0: 
        #full http url
        return path

    elif path.find("//") == 0: 
        #http url lack of http:
        return "http:" + path

    elif path.find("//") < 0 and path.find("/") == 0: 
        # located in the folder at the root of the current web
        site_root = re.findall("http.{3,4}[^/]+",site)
        return site_root[0] + path

    elif rp: 
        # located in the folder one level up from the current folder
        sitep = len(re.findall(r"([^/]+)+",site)) - 2 - len(rp)
        # raise error when sitep-len(rp)
        new_path = re.findall("(http.{4}[^/]+)(/[^/]+){%d}"%(sitep),site)
        return "{}/{}".format("".join(new_path[0]),path.replace( "".join(rp) , ""))

    else:
        #  located in the folder one level up from the current folder
        #  located in the same folder as the current page
        return "{}/{}".format(site,path)


assert "https://en.wikipedia.org/wiki/IMAGE/a.jpg" == r2a("a.jpg")
assert "https://en.wikipedia.org/wiki/IMAGE/unknow/a.jpg" == r2a("unknow/a.jpg")
assert "https://en.wikipedia.org/unknow/a.jpg" == r2a("/unknow/a.jpg")
assert "https://en.wikipedia.org/wiki/a.jpg" == r2a("../a.jpg")
assert "https://en.wikipedia.org/a.jpg" == r2a("../../a.jpg")
assert "https://en.wikipedia.org/wiki/IMAGE/a.jpg" == r2a("https://en.wikipedia.org/wiki/IMAGE/a.jpg")
assert "http://en.wikipedia.org/" == r2a("//en.wikipedia.org/")

关于python - 无法使用 Beautiful Soup 下载所有图像,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53051407/

相关文章:

python - 通过bs4打印抓取的网页时出错

python - 在 Python 中提取和清理 HTML 正文文本的最快、最无错误的方法是什么?

python - 创建形状文件

python - 使用 Beautiful Soup 在 Python 中递归地抓取网站的所有子链接

Python Pandas 使用不同日期读取多个 Excel 文件

python - 在列表中插入项目后增加每个元素

python - 尝试导入 BeautifulSoup 时出现异常

python - 如何提取特定类名的文本,后跟特定文本?

python - numpy - 如何为数组第一列中的每个元素添加一个值?

python - split() 字符串上的 Python strip() 有什么作用吗?