我正在尝试使用 beautifulsoup 解析存储在名为 site_links
的列表中的 82 个 URL 中的图像源。我不知道为什么这个循环在中途抛出错误。有什么想法吗?
错误:
/images/africa/egypt/abu-gorab-sun-temples/sun-temple-of-niuserre-main.jpg
/images/africa/egypt/abu-roash-pyramid-of-djedefre/abu-roash-pyramid-of-djedefre-main.jpg
/images/africa/egypt/abusir-necropolis/abusir-necropolis-main1.jpg
/images/africa/egypt/dashur-bent-pyramid/dashur-bent-pyramid-main1.jpg
/images/africa/egypt/giza-plateau-pyramid-complex/giza-plateau-pyramid-complex-main1.jpg
/images/africa/egypt/giza-plateau-sphinx/giza-plateau-sphinx-main1.jpg
/images/africa/egypt/zawyet-el-aryan-unfinished-pyramid/zawyet-el-aryan-unfinished-pyramid-main2.jpg
/images/africa/egypt/abu-simbel-temple-complex/abu-simbel-temple-complex-main1.jpg
/images/africa/egypt/aswan-elephantine-island/aswan-elephantine-island-main.jpg
/images/africa/egypt/denderra-temple-complex/denderra-temple-complex-main2.jpg
/images/africa/egypt/thebes-karnak-temple-complex/thebes-karnak-temple-complex-main5.jpg
/images/africa/egypt/thebes-luxor-temple/thebes-luxor-temple-main3.jpg
/images/africa/ethiopia/axum-obelisks/axum-obelisks-main1.jpg
/images/africa/ethiopia/lalibela-rock-hewn-churches/lalibela-rock-hewn-churches-main3.jpg
/images/asia/india/ellora-kailasa-temple/ellora-kailasa-temple-main1.jpg
/images/asia/india/warangal-warangal-fort/warangal-warangal-fort-main1.jpg
/images/asia/indonesia/west-java-gunung-padang/west-java-gunung-padang-main1.jpg
/images/asia/japan/yonaguni-yonaguni-monument/yonaguni-yonaguni-monument-main1.jpg
/images/asia/laos/xiangkhouang-plain-of-jars/xiangkhouang-plain-of-jars-main1.jpg
/images/asia/lebanon/baalbek-baalbek-temple-complex/baalbek-baalbek-temple-complex-main4.jpg
/images/asia/micronesia/pohnpei-nan-madol/pohnpei-nan-madol-main1.jpg
Traceback (most recent call last):
File "c:/Users/J/Google Drive/pythonProjects/Megalith Map/data_scrape.py", line 41, in <module>
img = soup.find('div', {'itemprop' : 'blogPost'}).find_all('img')[0].get('src')
IndexError: list index out of range
我的代码:
site_links = []
site_img = []
# PARSES ALL IMAGE SOURCES ON THE WEBSITE
for i in site_links:
r = requests.get(i).text
soup = bs4.BeautifulSoup(r, 'html5lib')
img = soup.find('div', {'itemprop' : 'blogPost'}).find_all('img')[0].get('src')
if '.jpg' in img:
site_img.append(site_img)
print(img)
最佳答案
find_all
的结果是类似list
的。如果您尝试在它为空时对其进行索引,则会引发错误。
这表明它无法在页面上找到符合您指定条件的任何内容。要处理这种情况,您应该首先检查 find_all
是否找到任何内容,然后才对其进行索引:
site_links = []
site_img = []
# PARSES ALL IMAGE SOURCES ON THE WEBSITE
for i in site_links:
r = requests.get(i).text
soup = bs4.BeautifulSoup(r, 'html5lib')
images = soup.find('div', {'itemprop' : 'blogPost'}).find_all('img')
if images:
img = images[0].get('src', '')
if '.jpg' in img:
site_img.append(site_img)
print(img)
else:
print('No image found.')
请注意,我还修改了 get
调用以在找不到 src
时返回空字符串,这也将防止引发错误,因为它会返回 None
否则,这将导致以下包含测试失败。
关于python - 为什么我得到 "IndexError: list index out of range",在 for 循环期间,漂亮的汤解析中途?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56140299/