python - 使用 BeautifulSoup 为每个子页面抓取数据 - url 很长且格式不同

标签 python loops url beautifulsoup

我正在抓取 NFL passing data从 1971 年到 2019 年。我能够使用以下代码在每年的第一页上抓取数据:

# This code works:
passingData = []  # create empty list to store column data     
for year in range(1971,2020):
    url = 'https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc' % (year)
    response = requests.get(url)
    response = response.content
    parsed_html = bsoup(response, 'html.parser')
    data_rows = parsed_html.find_all('tr')
    passingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])

每年的第一页只有25名球员,每年大约有70-90名球员传球(所以每年“子页面”上有3-4页球员数据)。当我尝试抓取这些子页面时,问题就来了。我尝试添加另一个子循环,将每个链接的 href 拉出到下一页,并附加到 div 类 'nfl-o-table-pagination__buttons'

不幸的是,我无法从第一页添加到 passingData 列表。我尝试了以下操作,但 subUrl 行出现“索引超出范围错误”。

我对网络抓取还是个新手,所以如果我的逻辑不对请告诉我。我想我可以只附加子页面数据(因为表结构是相同的),但是当我尝试从以下位置开始时似乎出现了错误:
https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc
到第二页,其网址为:
https://www.nfl.com/stats/player-stats/category/passing/2019/REG/all/passingYards/DESC?aftercursor=0000001900000000008500100079000840a7a000000000006e00000005000000045f74626c00000010706572736f6e5f7465616d5f737461740000000565736249640000000944415234363631343100000004726f6c6500000003504c5900000008736561736f6e496400000004323031390000000a736561736f6e5479706500000003524547f07fffffe6f07fffffe6389bd3f93412939a78c1e6950d620d060004

    for subPage in range(1971,2020):
        subPassingData = []
        subUrl = soup.select('.nfl-o-table-pagination__buttons a')[0]['href']
        new = requests.get(f"{url}{subUrl}")
        newResponse = new.content
        soup1 = bsoup(new.text, 'html.parser')
        sub_data_rows = soup1.find_all('tr')
        subPassingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])
        
    passingData.append(subPassingData)

感谢您的帮助。

最佳答案

此脚本适用于所有选定的年份和子页面,并将数据加载到数据框(或者您可以将其保存为 csv,等等...):

import requests
from bs4 import BeautifulSoup

url = 'https://www.nfl.com/stats/player-stats/category/passing/{year}/REG/all/passingyards/desc'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

all_data = []

for year in range(2017, 2020):  # <-- change to desired year
    soup = BeautifulSoup(requests.get(url.format(year=year), headers=headers).content, 'html.parser')
    page = 1

    while True:
        print('Page {}/{}...'.format(page, year))

        for tr in soup.select('tr:has(td)'):
            tds = [year] + [td.get_text(strip=True) for td in tr.select('td')]
            all_data.append(tds)

        next_url = soup.select_one('.nfl-o-table-pagination__next')
        if not next_url:
            break

        u = 'https://www.nfl.com' + next_url['href']
        soup = BeautifulSoup(requests.get(u, headers=headers).content, 'html.parser')
        page += 1


# here we create dataframe from the list `all_data` and print it to screen:
from pandas import pd
df = pd.DataFrame(all_data)
print(df)

打印:

Page 1/2017...
Page 2/2017...
Page 3/2017...
Page 4/2017...
Page 1/2018...
Page 2/2018...
Page 3/2018...
Page 4/2018...
Page 1/2019...
Page 2/2019...
Page 3/2019...
Page 4/2019...
        0                   1     2    3    4    5      6   7   8      9   10     11  12  13  14  15   16
0    2017           Tom Brady  4577  7.9  581  385  0.663  32   8  102.8  230  0.396  62  10  64  35  201
1    2017       Philip Rivers  4515  7.9  575  360  0.626  28  10     96  216  0.376  61  12  75  18  120
2    2017    Matthew Stafford  4446  7.9  565  371  0.657  29  10   99.3  209   0.37  61  16  71  47  287
3    2017          Drew Brees  4334  8.1  536  386   0.72  23   8  103.9  201  0.375  72  11  54  20  145
4    2017  Ben Roethlisberger  4251  7.6  561  360  0.642  28  14   93.4  207  0.369  52  14  97  21  139
..    ...                 ...   ...  ...  ...  ...    ...  ..  ..    ...  ...    ...  ..  ..  ..  ..  ...
256  2019      Trevor Siemian     3  0.5    6    3    0.5   0   0   56.3    0      0   0   0   3   2   17
257  2019       Blake Bortles     3  1.5    2    1    0.5   0   0   56.3    0      0   0   0   3   0    0
258  2019       Kenjon Barner     3    3    1    1      1   0   0   79.2    0      0   0   0   3   0    0
259  2019         Alex Tanney     1    1    1    1      1   0   0   79.2    0      0   0   0   1   0    0
260  2019          Matt Haack     1    1    1    1      1   1   0  118.8    1      1   0   0   1   0    0

[261 rows x 17 columns]

关于python - 使用 BeautifulSoup 为每个子页面抓取数据 - url 很长且格式不同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62645312/

相关文章:

python - 通过 CGI 部署 Python

python - Memsql::Streamliner Python 转换

python - 在 scikit-learn 中使用 DecisionTreeClassifier 修复 100% 的准确率

r - 如何从R中的url地址直接读取图像文件

python - 显示比 "No JSON object could be decoded"更好的错误消息

php - 如何在 mysql php 中立即获取所有可用行

java - 更新 Java 循环中的总计

Javascript 列表和数组索引

ios - 从描述 rss ios 中获取图像 url

JAVA获取URL API