我正在尝试抓取 URL ' https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm ' 使用 BeautifulSoup 。 我想抓取球员姓名、伤病情况以及受伤周
玩家姓名可以直接抓取,因为它是某个标签中的文本 <th>
并且始终包含在标签中。周是一个属性 ["data-stat"]
标签 <td>
并且也始终包含在标签中。伤害也是属性["data-tip"]
同一标签周是<td>
,但只有当球员受伤时才会包含在标签中。
我尝试使用 if else 语句来表示受伤状态,因此如果 <td>
标签包含伤害,它将打印伤害 ["data-tip"]
如果没有,它只会打印“NA”。从我编写的代码来看,它打印了前两名球员的姓名、受伤情况和受伤周数,但第三名球员不包含受伤属性 ["data-tip"]
在 <td>
标签和代码会中断并只打印前两个玩家:
[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']
我的代码的结果!遇到 KeyError。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
containers = page_soup.find("tbody")
player = containers.find_all("tr")
for tr in player:
th = tr.find_all("th")
name = [i.text for i in th]
week = tr.td["data-stat"]
injury = tr.td["data-tip"]
if injury is None:
injury = "NA"
print([name, injury, week])
else:
print([name, injury, week])
我正在寻找的结果是打印表中所有球员的球员姓名、受伤情况(如果没有受伤,则打印“NA”)和受伤周的代码。例如,表中的第三名球员在第一周没有受伤,因此他的受伤情况应打印“NA”:
[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']
[['Kyle Arrington'], 'NA', 'week_1']
对于其他玩家来说,这个列表应该像这样继续下去。
最佳答案
我非常支持 Jack Moody 的解决方案(只是添加了额外的几周),但这里是额外的数据/列:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
containers = page_soup.find("tbody")
head = page_soup.find("thead")
player = containers.find_all("tr")
weeks = head.find_all('th')
week_list = [i['data-stat'] for i in weeks][1:]
for week in week_list:
for tr in player:
th = tr.find_all("th")
name = [i.text for i in th]
td = tr.find('td', {'data-stat':week})
week = td["data-stat"]
try:
injury = td["data-tip"]
print([name, injury, week])
except KeyError:
injury = "NA"
print([name, injury, week])
关于python - 并不总是包含在标签 Python Beautifulsoup 中的网页抓取属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54440369/