python - 从网页中抓取 pdf

我想从丹麦公司注册处 (csv register) 下载指定公司的所有财务报告。一个例子可能是 Chr。 Hansen Holding 在以下链接中:

https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da

具体来说，我想在“Regnskaber”(=财务报告)选项卡下下载所有 PDF。我以前没有使用 Python 进行网页抓取的经验。我尝试使用 BeautifulSoup，但鉴于我不存在的经验，我无法从响应中找到正确的搜索方式。

以下是我尝试过的方法，但没有打印任何数据(即没有找到任何 pdf)。

from urllib.parse import urljoin
from bs4 import BeautifulSoup

web_page = "https://datacvr.virk.dk/data/visenhed? 
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"

response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')

for link in soup.select("a[href$='.pdf']"):
    print(link['href'].split('/')[-1])

我们将不胜感激所有帮助和指导。

最佳答案

你应该使用 select 而不是 findAll

from urllib.parse import urljoin
from bs4 import BeautifulSoup

web_page = "https://datacvr.virk.dk/data/visenhed? 
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"

response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')

for link in pdfs:
    print(link['href'].split('/')[-1])

关于python - 从网页中抓取 pdf，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60903209/

上一篇：perl - 打开 STDIN/STDOUT 如何正确处理和使用 utf8 编码？

下一篇：typescript - typescript 中可调用类型关系的问题

python - pandas groupby 在分类系列中运行

python - 让一个线程中的 "socket.accept"在另一个线程中的某些代码之前执行？ (python3)

python - Yield Request调用在scrapy的递归方法中产生奇怪的结果

python - 如何提取div标签中的强元素

python - BeautifulSoup 返回意外的额外空格

python - 使用列表的 Numpy 3D 数组索引

Python与美汤: Extract data from a specific set of list in unordered list category

Python 请求在某些站点上抛出 SSL 错误

python - 使用 bs4 进行网页抓取不返回数值