我正在尝试抓取维基百科美国的 COVID-19 数据图表 ( https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases ),但在确定 HTML 元素是否包含文本时遇到了麻烦。我尝试过使用
element.text is not None
作为 if 条件,但这只是允许 HTML 元素不输出任何内容。
element.text != ''
有相同的结果。还有什么我可以检查的吗? 这是我的全部代码
def getCases(page):
cases = []
firstCaseChild = page.find(title='January 21, 2020')
firstCaseChild2 = firstCaseChild.find_parent('th')
row = 0
column = 0
firstRow = []
for case in firstCaseChild2.find_next_siblings('td'):
if column == 55:
break
if case.text is not None:
firstRow.append(case.text)
column = column+1
print(case.text)
else:
firstRow.append('0')
column = column+1
print('0')
最佳答案
另一种解决方案,不使用pandas
:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.tbody.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
tds = [int(td) if td else 0 for td in tds] # replace empty text '' with 0
print(('{:>5}'*len(tds)).format(*tds))
打印:
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 5 0 0 0 0 0 0 1 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0
0 0 5 0 0 0 0 0 0 1 0 5 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 12 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0
0 0 4 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 9 0 0 0 0 0 0 0
0 0 8 2 0 0 0 0 1 0 0 31 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 0 0 1 3 0 0 1 11 0 0 0 0 0 0 0
0 1 11 6 1 0 0 0 1 0 1 10 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 3 1 1 0 0 1 0 0 3 0 0 0 0 0 5 0 0 0 2 22 2 1 0 0 0 0 0
0 2 8 0 0 0 0 0 0 4 0 22 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 2 4 0 0 0 0 2 0 0 1 0 0 1 0 5 0 0 0 0 45 2 0 0 0 0 0 0
0 0 26 0 1 0 0 0 2 7 0 34 0 3 1 2 0 0 1 0 0 2 0 0 0 0 0 0 4 4 3 0 0 0 4 2 4 1 0 1 1 0 15 2 0 2 2 17 2 0 1 0 0 0 0
0 0 19 4 0 0 0 0 0 0 0 26 0 5 4 2 0 0 0 0 0 0 3 0 0 1 0 0 2 6 2 1 0 5 1 1 1 3 0 1 3 0 13 0 0 0 5 36 4 0 0 0 0 0 0
0 1 24 5 0 0 0 0 1 1 1 105 0 5 8 4 0 2 1 0 0 2 1 1 5 1 0 0 9 5 2 2 0 0 2 3 4 4 0 0 0 0 51 1 0 1 4 31 2 2 0 0 0 0 0
0 3 20 17 0 0 1 4 0 4 2 99 1 1 6 2 0 0 2 0 1 0 0 0 3 3 0 1 3 9 0 10 1 1 1 2 4 0 0 1 5 1 3 3 0 0 8 43 4 0 0 0 0 0 0
1 0 21 15 0 0 0 2 0 5 1 91 0 2 7 0 4 10 4 1 0 5 1 1 0 2 0 5 18 11 3 6 0 7 2 9 2 8 0 3 0 3 13 3 1 1 6 112 6 0 1 0 0 0 0
0 0 49 28 0 1 3 4 6 6 4 111 1 1 14 3 1 13 5 2 0 4 8 1 1 11 6 6 33 0 3 17 5 0 1 8 16 13 0 5 0 0 15 5 2 1 21 93 19 15 0 0 0 3 1
...and so on.
关于python - 在Python中使用BS4确定HTML是否包含文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62959171/