Python脚本从HTML页面提取数据

我正在尝试对大学篮球队进行大量数据积累。此链接:https://www.teamrankings.com/ncb/stats/有大量的团队统计数据。

我尝试编写一个脚本，扫描此页面中所有所需的链接(所有团队统计数据)，查找指定团队的排名(输入)，然后返回所有链接中该团队排名的总和。

我很高兴地找到了这个:https://gist.github.com/phillipsm/404780e419c49a5b62a8

...这太棒了!

但我一定有什么问题，因为我得到了 0

这是我的代码:

import requests
from bs4 import BeautifulSoup
import time

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")

stat_links = []

for table_row in soup.select(".expand-section li"):

    table_cells = table_row.findAll('li')

    if len(table_cells) > 0:
        link = table_cells[0].find('a')['href']
        stat_links.append(link)

total_rank = 0

for link in stat_links:
    r = requests.get(link)
    soup = BeaultifulSoup(r.text)

    team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")

    for row in team_rows:
        if row.findAll('td')[1].text.strip() == 'Oklahoma':
            rank = row.findAll('td')[0].text.strip()
            total_rank = total_rank + rank

print total_rank

查看该链接以仔细检查我是否指定了正确的类。我感觉问题可能出在第一个 for 循环中，我选择了一个 li 标签，然后选择第一个标签中的所有 li 标签，我不知道。

我不使用Python，所以我不熟悉任何调试工具。因此，如果有人想将我转发给其中之一，那就太好了!

最佳答案

首先，球队统计数据和球员统计数据部分包含在 'div class='large column-2' 中。球队统计数据是第一次出现。然后你就可以找到其中的所有href标签。我已将两者合并为一行。

teamstats = soup(class_='column large-2')[0].find_all(href=True)

teamstats 列表包含所有“a”标签。使用列表理解来提取链接。一些 href 包含“#”(导航链接的一部分)，因此我排除了它们。

links = [a['href'] for a in teamstats if a['href'] != '#']

以下是输出示例:

links
Out[84]: 
['/ncaa-basketball/stat/points-per-game',
 '/ncaa-basketball/stat/average-scoring-margin',
 '/ncaa-basketball/stat/offensive-efficiency',
 '/ncaa-basketball/stat/floor-percentage',
 '/ncaa-basketball/stat/1st-half-points-per-game',

关于Python脚本从HTML页面提取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34647146/

Python脚本从HTML页面提取数据

上一篇：python - 在 python 中渲染格式化文本(当前使用 pyglet)

下一篇：python - python 中用于电子邮件解析的正则表达式