python - 当没有图像扩展名时，使用 Beautiful Soup 获取图像数据-src

我正在尝试获取此页面上所有书籍的所有图像网址https://www.nb.co.za/en/books/0-6-years和 BeautifulSoup .

这是我的代码:

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.nb.co.za/"
productlinks = []

r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'div' and
        'img-container' in tag.parent['class'])

for item in productlist:
    for link in item.find_all(my_filter, href=True):
        productlinks.append(baseurl + link['href'])

        cover = soup.find_all('div', class_="img-container")
        print(cover)

这是我的输出:

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

我希望得到什么:

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

我的问题是:

如何仅获取数据源？
如何获取图片的扩展名？

最佳答案

1: How do I get the data-source only?

您可以通过调用element['data-src']来访问data-src:

cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover

2: How to I get the extension of the image?

您可以访问文件的扩展名，如 diggusbickus 提到的(很好的方法)，但是如果您尝试请求像 https://www.nb.co.za/en/helper/ReadImage/25929.jpg 这样的文件，这对您没有帮助。 这将导致 404 错误。

图像是动态加载/提供附加信息的 -> https://stackoverflow.com/a/5110673/14460824

示例

baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []

for item in soup.select('.book-slider-frame'):
    
    data.append({
        'link' : baseurl+item.a['href'],
        'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
    })
    
data

输出

[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]

关于python - 当没有图像扩展名时，使用 Beautiful Soup 获取图像数据-src，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70134620/

python - 当没有图像扩展名时，使用 Beautiful Soup 获取图像数据-src

示例

输出

上一篇：python - 我在使用时遇到问题 (EC.presence_of_element_ located(By.class, ""))

下一篇：javascript - 脚本链接中的@(at)符号导致异常