我正在使用BeautifulSoup
抓取页面上的第一个 wikitable List of military engagements during the Russian invasion of Ukraine获取所有 57 场战斗的名称。我附上了该表的 HTML 图像以供引用:HTML of the wikitable .
要获取所有 <a>
第一列中的元素并仅获取文本(战斗名称),我执行了以下操作:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
battlenames = []
for row in rows:
# Find the first <td> element within the row
td_element = row.find('td')
if td_element:
# Find the first <a> element within the <td> element
battlename = td_element.find('a')
cleanname = battlename.text
battlenames.append(cleanname)
for name in battlenames:
print(name)
我在 Spyder 和 Jupyter Notebook 中运行此命令并收到以下错误:
AttributeError Traceback (most recent call last)
Cell In[6], line 18
15 if td_element:
16 # Find the first <a> element within the <td> element
17 battlename = td_element.find('a')
---> 18 cleanname = battlename.text
19 battlenames.append(cleanname)
21 for name in battlenames:
AttributeError: 'NoneType' object has no attribute 'text'
这让我很惊讶,因为第一个 <td>
每行 ( <tr>
) 的元素包含 <a>
带有战斗名称的元素。即,表的第一列中没有会导致 NoneType 错误的空框。可能是什么问题?
最佳答案
编辑
根据 @Ouroboros1 的评论,更准确地说,问题在于 td
的某些元素不包含 a
。
table contains one "sub" tr for "Battles of Voznesensk", where the first td fills "9 March 2022" in the "Start date" column. Now, this td just happens to have no link
a
所以在调用.text
之前你还必须检查是否有a
:
if td_element:
# Find the first <a> element within the <td> element
battlename = td_element.find('a')
# check hier if also a is available
if battlename:
cleanname = battlename.text
battlenames.append(cleanname)
您也可以尝试改变您的选择策略,可以使用css selectors
仅选择包含 a
的 tr
和 td
:
soup.table.select('tr:has(td:first-of-type a)')
或者甚至直接将tr
的第一个td
中的所有a
:
soup.table.select('tr td:first-of-type a')
CSS 选择器示例
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
#Option A
for row in soup.table.select('tr:has(td:first-of-type a)'):
print(row.td.a.text)
#Option B
for a in soup.table.select('tr td:first-of-type a'):
print(a.text)
关于python - 尝试访问现有 <a> 元素的 .text 属性时出现 NoneType 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77364381/