javascript - 如何从 JavaScript 网站抓取数据?

标签 javascript python selenium selenium-webdriver xpath

我正在尝试从 this dynamic JavaScript website 抓取数据。由于页面是动态的,我使用 Selenium 从表中提取数据。请建议我如何从动态表中抓取数据。这是我的代码。

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import lxml.html as LH
import requests

# specify the url
urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
print(urlpage)

# run firefox webdriver from executable path of your choice
driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
##driver = webdriver.Firefox(executable_path = 'C:/Users/Shresth Suman/Downloads/geckodriver-v0.26.0-win64/geckodriver.exe')

# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 5s
time.sleep(5)
# driver.quit()


# find elements by xpath
##results = driver.find_elements_by_xpath("//div[@id='div_taboa']//table[@id='taboa']/tbody")
##results = driver.find_elements_by_xpath("//*[@id='page-title']")
##results = driver.find_elements_by_xpath("//*[@id='div_main']/h2[1]")
results = driver.find_elements_by_xpath("//*[@id = 'frame_historicos']")
print(results)
print(len(results))


# create empty array to store data
data = []
# loop over results
for result in results:
    heading = result.text
    print(heading)
    headingfind = result.find_element_by_tag_name('h1')
    # append dict to array
    data.append({"head" : headingfind, "name" : heading})
# close driver 
driver.quit()
###################################################################



# save to pandas dataframe
df = pd.DataFrame(data)
print(df)
# write to csv
df.to_csv('testsot.csv')

我想提取 2005 年至今的数据,平均值/总计为 10 分钟,这仅提供一个月的数据。

最佳答案

  1. 引发 WebDriverWaitelement_to_be_clickable()
  2. 安装漂亮的汤库
  3. 使用pandas read_html()
  4. 我还没有创建列表。您应该为自 1/1/2005

    以来的所有月份创建开始日期和结束日期列表以及 itearte
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    from bs4 import BeautifulSoup
    import time
    urlpage = 'http://www.sotaventogalicia.com/en/real-time-data/historical'
    driver = webdriver.Chrome('C:/Users/Shresth Suman/Downloads/chromedriver_win32/chromedriver.exe')
    driver.get(urlpage)
    WebDriverWait(driver,20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"frame_historicos")))
    inputstartdate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[1]")))
    inputstartdate.clear()
    inputstartdate.send_keys("1/1/2005")
    inputenddate=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//input[@class='dijitReset dijitInputInner'])[last()]")))
    inputenddate.clear()
    inputenddate.send_keys("1/31/2005")
    WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//input[@class='form-submit'][@value='REFRESH']"))).click()
    WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#taboa")))
    time.sleep(3)
    soup=BeautifulSoup(driver.page_source,"html.parser")
    table=soup.find("table", id="taboa")
    df=pd.read_html(str(table))
    df.to_csv('testsot.csv')
    print(df)
    

关于javascript - 如何从 JavaScript 网站抓取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58819691/

相关文章:

javascript - 外部文件未加载到 HTML 中

javascript - 替换句子中的字符串

python - Numpy reshape 产生不同的大小错误

set_alarm 的 Python Urwid 问题

nunit - 将外部数据源与 NUnit 的 TestCaseAttribute 一起使用

javascript - Function.prototype.softBind 中的 '!this'

php - 消除服务器端 View |优点缺点

python请求上传文件

python - 如何通过 Python 中的 Selenium 从相对于多个分隔符动态变化的字符串中检索子字符串

python - 网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver