python - 如何让selenium页面等待内容

标签 python selenium selenium-webdriver

该网站对于抓取的不同网址给出了相同的结果。我想原因是 Selenium 在产生结果之前没有让网站完全加载。我首先使用 beautiful soup 编写了代码,但根据 SO 社区的说法,必须使用 selenium 来获取最终的网页进行抓取。我实现了 selenium 来抓取数据,并实现了漂亮的 soup 来解析数据,但仍然存在同样的问题。 代码如下:

from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import datetime
import os
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
date_list = pd.date_range(start = "1971-02-01", end=datetime.date.today(), freq='1d')


chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background
driver = webdriver.Chrome()

def get_batsmen(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting?at={date}'
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
        browser.implicitly_wait(10)
        
    
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_bowler(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

def get_allrounder(date):
    url = f'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/all-rounder?at={date}'
    # page = requests.get(url).text
    with Chrome(options=chrome_options) as browser:
        browser.get(url)
        html = browser.page_source
    doc = BeautifulSoup(html, "html.parser")
    find_class = doc.find_all("td", class_ = 'table-body__cell rankings-table__name name')
    player_list = []
    find_top = doc.find('div', class_='rankings-block__banner--name-large')
    player_list.append(find_top.text)
    for item in find_class:
        player_name = item.find("a")
        # print(player_name.text)
        player_list.append(player_name.text)
    df = pd.DataFrame(player_list, columns = ['Player Name'])
    return df

#Storing the data into multiple csvs

for date in date_list:
    year = date.year
    month = date.month
    day = date.day
    newpath = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}'
    if not os.path.exists(newpath):
        os.makedirs(newpath)
    newpath1 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}'
    if not os.path.exists(newpath1):
        os.makedirs(newpath1)
    newpath2 = rf'C:\Users\divya\OneDrive\Desktop\8th Sem\ISB assignment\{year}\{month}\{day}'
    if not os.path.exists(newpath2):
        os.makedirs(newpath2)
    get_batsmen(date).to_csv(newpath2+'/batsmen.csv')
    get_bowler(date).to_csv(newpath2+'/bowler.csv')
    get_allrounder(date).to_csv(newpath2+'/allrounder.csv')

我将永远感激任何能提供帮助的人

最佳答案

使用其他方法可能会有所帮助,请尝试以下操作

WebDriverWait(browser, delay) 

引用this Answer

关于python - 如何让selenium页面等待内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70971171/

相关文章:

Python 从 pickle 保存的数据中删除条目

java - 在执行期间检查 WebDriver 测试中的加载时间

python - 如何在python中实现自己的自定义字典类

python - Django 上的实时图表

selenium - 如何在页面完全加载之前启动点击

python - T2.micro running python scraper - 无法控制的CPU

java - Microsoft Edge : org. openqa.selenium.remote.SessionNotFoundException: null

c# - 如何在 Selenium 中自动接受 Chrome 的 "Always open these types of links in the associated app"对话框

javascript - 如何检测 HTML 代码中的文本

python - 如何在 Windows 上使用 Python 创建 OS X 应用程序