python - ESPN.com Python 网页抓取问题

标签 python selenium web-scraping beautifulsoup

我正在尝试提取所有大学橄榄球队名单的数据,因为我想根据名单的组成对球队的表现进行一些分析。

我的脚本正在第一页上运行,它会迭代每个团队,并可以打开每个团队的名册链接,但随后我在团队的名册页面上运行的 Beautiful Soup 命令不断抛出索引错误。当我查看 HTML 时,似乎我正在编写的命令应该可以工作,但是当我从 Beautiful Soup 打印页面源代码时,我看不到在 Chrome 开发人员工具中看到的内容。这是用来提供内容的 JS 实例吗?如果是这样,我认为 Selenium 解决了这个问题?

我的代码...

import requests
import csv
from bs4 import BeautifulSoup
from selenium import webdriver

teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_soup = BeautifulSoup(teams_html, "html5lib")

i = 0

for link_html in teams_soup.find_all('a'):
    if link_html.text == 'Roster':
        roster_link = 'https://www.espn.com' + link_html['href']

        roster_driver = webdriver.Firefox()
        roster_driver.get(roster_link)
        roster_html = teams_driver.page_source
        roster_soup = BeautifulSoup(roster_html, "html5lib")

        team_name_html = roster_soup.find_all('a', class_='sub-brand-title')[0]
        team_name = team_name_html.find_all('b')[0].text

        for player_html in roster_soup.find_all('tr', class_='oddrow'):
            player_name = player_html.find_all('a')[0].text
            player_pos = player_html.find_all('td')[2].text
            player_height = player_html.find_all('td')[3].text
            player_weight = player_html.find_all('td')[4].text
            player_year = player_html.find_all('td')[5].text
            player_hometown = player_html.find_all('td')[6].text

            print(team_name)
            print('\t', player_name)

        roster_driver.close()

teams_driver.close()

最佳答案

在 for 循环中,您使用的是第一页的 html (roster_html = team_driver.page_source),因此当您尝试选择 的第一项时,会出现索引错误team_name_html 因为 find_all 返回一个空列表。

此外,您不需要打开所有 Firefox 实例,您可以在获得 html 后关闭驱动程序。

teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_driver.quit()

但是您不必使用 selenium 来完成此任务,您可以使用 requestsbs4 获取所有数据。

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.espn.com/college-football/teams")
teams_soup = BeautifulSoup(r.text, "html5lib")

for link_html in teams_soup.find_all('a'):
    if link_html.text == 'Roster':
        roster_link = 'https://www.espn.com' + link_html['href']
        r = requests.get(roster_link)
        roster_soup = BeautifulSoup(r.text, "html5lib")

        team_name = roster_soup.find('a', class_='sub-brand-title').find('b').text
        for player_html in roster_soup.find_all('tr', class_='oddrow'):
            player_name = player_html.find_all('a')[0].text
            player_pos = player_html.find_all('td')[2].text
            player_height = player_html.find_all('td')[3].text
            player_weight = player_html.find_all('td')[4].text
            player_year = player_html.find_all('td')[5].text
            player_hometown = player_html.find_all('td')[6].text
            print(team_name, player_name, player_pos, player_height, player_weight, player_year, player_hometown)

关于python - ESPN.com Python 网页抓取问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47447024/

相关文章:

javascript - 在 Selenium 中点击 JavaScript 按钮 - Java

python - 如何使用Beautifulsoup4和Python 3 Web抓取youtube成绩单

python - lxml 经典 : Get text content except for that of nested tags?

python - 使用 Python 从 CouchDB-Futon 编写的 View 中获取数据

python包输出字符串格式

python - 遍历一个 Dataframe 以获取基于另一个 Dataframe 的值

javascript - 如何解决 Python Selenium 错误 which

python - 在 spacy 中执行 noun_chunks (或在 textblob 中执行 np_extractor )时,如何添加一些我已经知道的名词短语?

java - 如何跳过执行期间挂起/卡住的测试 Selenium - Maven - Jenkins

r - 从 rvest R 中的属性中抓取名称(值)