我正在Python上使用BeautifulSoup从这个网站上抓取足球统计数据:https://www.skysports.com/premier-league-results/2020-21 。然而该网站只显示了本赛季的前 200 场比赛,其余 180 场比赛都在“显示更多”按钮后面。该按钮不会更改 url,因此我不能只替换 url。
这是我的代码:
from bs4 import BeautifulSoup
import requests
scores_html_text = requests.get('https://www.skysports.com/premier-league-results/2020-21').text
scores_soup = BeautifulSoup(scores_html_text, 'lxml')
fixtures = scores_soup.find_all('div', class_ = 'fixres__item')
这只获取前 200 个赛程。
如何通过“显示更多”按钮访问 html?
最佳答案
隐藏的结果在<script>
里面标签,因此要获得所有 380 个结果,您需要另外解析它:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.skysports.com/premier-league-results/2020-21"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
script = soup.select_one('[type="text/show-more"]')
script.replace_with(BeautifulSoup(script.contents[0], "html.parser"))
all_data = []
for item in soup.select(".fixres__item"):
all_data.append(item.get_text(strip=True, separator="|").split("|")[:5])
all_data[-1].append(
item.find_previous(class_="fixres__header2").get_text(strip=True)
)
df = pd.DataFrame(
all_data, columns=["Team 1", "Score 1", "Score 2", "Time", "Team 2", "Date"]
)
print(df)
df.to_csv("data.csv", index=False)
打印:
Team 1 Score 1 Score 2 Time Team 2 Date
0 Arsenal 2 0 16:00 Brighton and Hove Albion Sunday 23rd May
1 Aston Villa 2 1 16:00 Chelsea Sunday 23rd May
2 Fulham 0 2 16:00 Newcastle United Sunday 23rd May
3 Leeds United 3 1 16:00 West Bromwich Albion Sunday 23rd May
...
377 Crystal Palace 1 0 15:00 Southampton Saturday 12th September
378 Liverpool 4 3 17:30 Leeds United Saturday 12th September
379 West Ham United 0 2 20:00 Newcastle United Saturday 12th September
并保存data.csv
(来自 LibreOffice 的屏幕截图):
关于python - 如何使用 BeautifulSoup Python 抓取 "show more"按钮?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69118605/