python - 找到抓取网站的正确元素

我试图仅从 this 中抓取某些文章主页。更具体地说，我试图仅从子页面媒体和子子页面 Press releases 中抓取文章。 ; Governing Council decisions ; Press conferences ; Monetary policy accounts ; Speeches ; Interviews ，也只是那些英文的。

我设法(基于一些教程和其他SE:溢出答案)编写了一个代码，该代码可以从网站上完全抓取所有内容，因为我最初的想法是抓取所有内容，然后在数据框中稍后清除输出，但是网站包含太多内容，以至于在一段时间后总是卡住。

获取子链接:

import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ] 
sub_links = {}
for master_atag in master_atags:
    master_href = master_atag.get('href')
    master_href = base_url + master_href
    print(master_href)
    master_links.append(master_href)
    sub_request = requests.get(master_href)
    sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
    sub_atags = sub_soup.find_all("a", href=True)
    sub_links[master_href] = []
    for sub_atag in sub_atags:
        sub_href = sub_atag.get('href')
        sub_links[master_href].append(sub_href)
        print("\t"+sub_href)

我尝试过的一些事情是将基本链接更改为子链接 - 我的想法是，也许我可以为每个子页面单独执行此操作，然后将链接放在一起，但这不起作用)。我尝试的其他事情是将第 17 行替换为以下内容；

sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)

这似乎部分解决了我的问题，因为即使它不仅仅从子页面获得链接，它至少忽略了不是“文档标题”的链接，这些链接是网站上带有文本的所有链接，但它是仍然太多，并且某些链接未正确检索。

我也尝试过以下操作:

for master_atag in master_atags:
    master_href = master_atag.get('href')
    for href in master_href:
        master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
    print(master_href)

我认为，因为所有带有英文文档的 href 中都有 .en ，这只会给我所有在 href 中出现 .en 的链接，但这段代码给了我 print(master_href) 的语法错误，我不明白因为之前 print(master_href) 有效。

接下来我想从子链接中提取以下信息。当我测试单个链接时，这部分代码可以工作，但我从未有机会在上面的代码上尝试它，因为它无法完成运行。一旦我设法获得所有链接的正确列表，这会起作用吗？

for links in sublinks:
    resp = requests.get(sublinks)
    soup = BeautifulSoup(resp.content, 'html5lib')
    article = soup.find('article')
    title = soup.find('title')
    textdate = soup.find('h2')
    paragraphs = article.find_all('p')
    matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
        for match in matches:
        print(match[0])
        datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})

同样回到抓取，因为第一种美丽汤的方法对我来说不起作用，我也尝试以不同的方式解决这个问题。该网站有 RSS 提要，因此我想使用以下代码:

import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url) 
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()

在这里我遇到了一个问题，甚至无法找到 RSS feed url。我尝试了以下操作:我查看了源页面，尝试搜索“RSS”并尝试了我可以通过这种方式找到的所有网址，但我总是得到空数据框。

我是网络抓取的初学者，目前我不知道如何继续或如何解决这个问题。最后，我想要完成的是从子页面收集所有文章及其标题、日期和作者，并将它们放入一个数据框中。

最佳答案

抓取此网站时遇到的最大问题可能是延迟加载:使用 JavaScript，他们从多个 html 页面加载文章并将它们合并到列表中。有关详细信息，请查看源代码中的 index_include 。这对于仅使用 requests 和 BeautifulSoup 进行抓取是有问题的，因为您的 soup 实例从请求内容中获取的只是基本框架，没有文章列表。现在您有两个选择:

使用延迟加载的文章列表，而不是主文章列表页面(新闻稿、采访等)，例如 /press/pr/date/2019/html/index_include.en.html 。这可能是更简单的选择，但您必须为您感兴趣的每一年都这样做。
使用可以执行 JavaScript(例如 Selenium)的客户端来获取 HTML，而不是请求。

除此之外，我建议使用 CSS 选择器从 HTML 代码中提取信息。这样，您只需要几行即可完成文章内容。另外，如果您使用 index.en.html 页面进行抓取，我认为您不必过滤英文文章，因为它默认显示英语，并且另外显示其他语言(如果有)。

这是我快速整理的一个示例，这当然可以优化，但它展示了如何使用 Selenium 加载页面并提取文章 URL 和文章内容:

from bs4 import BeautifulSoup
from selenium import webdriver

base_url = 'https://www.ecb.europa.eu'
urls = [
    f'{base_url}/press/pr/html/index.en.html',
    f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()

for url in urls:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    for anchor in soup.select('span.doc-title > a[href]'):
        driver.get(f'{base_url}{anchor["href"]}')
        article_soup = BeautifulSoup(driver.page_source, 'html.parser')

        title = article_soup.select_one('h1.ecb-pressContentTitle').text
        date = article_soup.select_one('p.ecb-publicationDate').text
        paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
        content = '\n\n'.join(p.text for p in paragraphs)

        print(f'title: {title}')
        print(f'date: {date}')
        print(f'content: {content[0:80]}...')

我得到了新闻稿页面的以下输出:

title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board                         
date: 20 December 2019                                    
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...

title: Monetary policy decisions                          
date: 12 December 2019                                    
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

关于python - 找到抓取网站的正确元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59773096/

python - 找到抓取网站的正确元素

上一篇：python - 从字典列表中查找最小键值，忽略 None 值

下一篇：Python 将每月和分钟数据帧与 TZ 感知的日期时间索引相结合