python - 并不总是包含在标签 Python Beautifulsoup 中的网页抓取属性

标签 python beautifulsoup

我正在尝试抓取 URL ' https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm ' 使用 BeautifulSoup 。 我想抓取球员姓名、伤病情况以及受伤周

玩家姓名可以直接抓取,因为它是某个标签中的文本 <th>并且始终包含在标签中。周是一个属性 ["data-stat"]标签 <td>并且也始终包含在标签中。伤害也是属性["data-tip"]同一标签周是<td> ,但只有当球员受伤时才会包含在标签中。

我尝试使用 if else 语句来表示受伤状态,因此如果 <td>标签包含伤害,它将打印伤害 ["data-tip"]如果没有,它只会打印“NA”。从我编写的代码来看,它打印了前两名球员的姓名、受伤情况和受伤周数,但第三名球员不包含受伤属性 ["data-tip"]<td>标签和代码会中断并只打印前两个玩家:

[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']

我的代码的结果!遇到 KeyError。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

containers = page_soup.find("tbody")

player = containers.find_all("tr")
for tr in player:
    th = tr.find_all("th")
    name = [i.text for i in th]

    week = tr.td["data-stat"]

    injury = tr.td["data-tip"]
    if injury is None:
        injury = "NA"
        print([name, injury, week])
    else:
        print([name, injury, week])

我正在寻找的结果是打印表中所有球员的球员姓名、受伤情况(如果没有受伤,则打印“NA”)和受伤周的代码。例如,表中的第三名球员在第一周没有受伤,因此他的受伤情况应打印“NA”:

[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']
[['Kyle Arrington'], 'NA', 'week_1']

对于其他玩家来说,这个列表应该像这样继续下去。

最佳答案

我非常支持 Jack Moody 的解决方案(只是添加了额外的几周),但这里是额外的数据/列:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

containers = page_soup.find("tbody")
head = page_soup.find("thead")


player = containers.find_all("tr")

weeks = head.find_all('th')
week_list = [i['data-stat'] for i in weeks][1:]

for week in week_list:
    for tr in player:
        th = tr.find_all("th")
        name = [i.text for i in th]
        
        td = tr.find('td', {'data-stat':week})
        week = td["data-stat"]
    
        try:
            injury = td["data-tip"]
            print([name, injury, week])
        except KeyError:
            injury = "NA"
            print([name, injury, week])

关于python - 并不总是包含在标签 Python Beautifulsoup 中的网页抓取属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54440369/

相关文章:

python - 在Python BeautifulSoup4中抓取特定标签之外的数据

python - 自动收集与pytest中几种不同模式匹配的文件名

python - 查询具有多个条件的mongodb数组字段

python - 为图像着色 Pygame

python - 从 html 表中抓取数据,选择标题之间的元素

python - BeautifulSoup4导入错误

python - Beautifulsoup 从无序列表中提取文本和链接 div < ul <li (斯堪的纳维亚字符)

python - 使用 multiprocessing.Pool.map() 时是否保证按顺序检索结果?

python - 使用客户端加密的 AWS S3 数据保护

python - 导入错误: No module named bs4 because in wrong python folder