python - 如何以编程方式获取 javascript 页面后面的 CSV 链接?

标签 python web-scraping beautifulsoup

我正在使用 python,当我单击 this page 底部的 DATA V CSV 按钮时,我试图获取 CSV 来源的链接。 .

我尝试了beautifulsoup:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ceps.cz/en/all-data#AktualniSystemovaOdchylkaCR'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# Find the link to the CSV file
csv_link = soup.find('a', string='DATA V CSV').get('href')

我也尝试过:

soup.find("button", {"id":"DATA V CSV"})

但它找不到 DATA V CSV 后面的链接。

最佳答案

为了获取所有数据,您需要完全模仿发送到服务器的请求。

具体操作方法如下:

from shutil import copyfileobj
from urllib.parse import urlencode

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "referer": "https://www.ceps.cz/en/all-data",
    "accept": "application/json, text/javascript, */*; q=0.01",
    "cookie": "nette-samesite=1; ARRAffinity=3ee2404f26d0149d946e50cb3d4c22661f9f3b6510837fa538c67990a81979de; ARRAffinitySameSite=3ee2404f26d0149d946e50cb3d4c22661f9f3b6510837fa538c67990a81979de"
}

payload = {
    "do": "loadGraphData",
    "method": "AktualniSystemovaOdchylkaCR",
    "graph_id": "1026",
    "move_graph": "day",
    "download": "csv",
    "date_to": "2023-03-28T23:59:59",
    "date_from": "2023-03-28T00:00:00",
    "agregation": "MI",
    "date_type": "day",
    "interval": "false",
    "version": "bla",
    "function": "AVG",
}

all_data = "https://www.ceps.cz/en/all-data"
download_url = "https://www.ceps.cz/download-data/?format=csv"

with requests.Session() as s:
    headers.update({"x-requested-with": "XMLHttpRequest"})
    r = s.get(f"{all_data}?{urlencode(payload)}", headers=headers)
    print(r.json()["result"])
    headers.pop("x-requested-with")
    with s.get(download_url, headers=headers, stream=True) as r, \
            open("data.csv", "wb") as f:
        copyfileobj(r.raw, f)

您应该得到 semicolon - 分隔的文件如下所示:

enter image description here

关于python - 如何以编程方式获取 javascript 页面后面的 CSV 链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75863105/

相关文章:

python - 应用引擎 : Difference between NDB and Datastore

web-scraping - 在 scrapy.Request 中添加 dont_filter=True 参数如何使我的解析方法起作用?

python - 只查找属性完全匹配的 HTML 节点

python 网络抓取雅虎金融

java - 获取 javax.net.ssl.SSLException : Received fatal alert: protocol_version while scraping data using Jsoup

python - 使用 python 抓取谷歌精选片段

python - 网络爬虫递归 BeautifulSoup

python - 在 Windows 上的 virtualenv 中安装 python-ldap

python - 如何在 Python/Pylab/Seaborn/Plotly 中创建比较散点图/群图?

python - Django 根据 bool 字段设置日期