我正在尝试获取此页面上所有书籍的所有图像网址https://www.nb.co.za/en/books/0-6-years
和 BeautifulSoup .
这是我的代码:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
这是我的输出:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
我希望得到什么:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
我的问题是:
如何仅获取数据源?
如何获取图片的扩展名?
最佳答案
1: How do I get the data-source only?
您可以通过调用element['data-src']
来访问data-src
:
cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
2: How to I get the extension of the image?
您可以访问文件的扩展名,如 diggusbickus 提到的(很好的方法),但是如果您尝试请求像 https://www.nb.co.za/en/helper/ReadImage/25929.jpg 这样的文件,这对您没有帮助。 这将导致 404 错误。
图像是动态加载/提供附加信息的 -> https://stackoverflow.com/a/5110673/14460824
示例
baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):
data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})
data
输出
[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
关于python - 当没有图像扩展名时,使用 Beautiful Soup 获取图像数据-src,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70134620/