python - 提取两个不同标签之间的文本 beautiful soup

标签 python html python-3.x web-scraping beautifulsoup

我正在尝试从 this web page 中提取文章的文本内容.

我只是想提取文章内容,而不是“关于作者部分”。

问题是所有内容都不在像 <div> 这样的标签中.因此我无法提取它们,因为它们都在 <p> 内标签。当我提取所有 <p>标签我也得到了“关于作者”部分。我必须从这个网站上抓取很多页面。有没有办法用漂亮的汤做到这一点?

我目前正在尝试:

p_tags=soup.find_all('p')
for row in p_tags:
    print(row)

最佳答案

您想要的所有段落都位于<div class="td-post-content"> 内与作者信息的段落一起标记。但是,所需的 <p>标签是这个 <div> 的直接子代标签,而另一个不需要<p>标签不是直接子标签(它们嵌套在其他 div 标签中)。

因此,您可以使用 recursive=False 仅访问这些标签。

代码:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):
    print(para.text)

输出:

Cybersecurity giant McAfee released its McAfee Labs Threat Report: June 2018 on Wednesday, outlining the growth and trends of new malware and cyber threats in Q1 2018. According to the report, coin mining malware saw a 623 percent growth in the first quarter of 2018, infecting 2.9 million machines in that period. McAfee Labs counted 313 publicly disclosed security incidents in the first three months of 2018, a 41 percent increase over the previous quarter. In particular, incidents in the healthcare sector rose 57 percent, with a significant portion involving Bitcoin-based ransomware that healthcare institutions were often compelled to pay.
Chief Scientist at McAfee Raj Samani said, “There were new revelations this quarter concerning complex nation-state cyber-attack campaigns targeting users and enterprise systems worldwide. Bad actors demonstrated a remarkable level of technical agility and innovation in tools and tactics. Criminals continued to adopt cryptocurrency mining to easily monetize their criminal activity.”
Sizeable criminal organizations are responsible for many of the attacks in recent months. In January, malware dubbed Golden Dragon attacked organizations putting together the Pyeongchang Winter Olympics in South Korea, using a malicious word attachment to install a script that would encrypt and send stolen data to an attacker’s command center. The Lazarus cybercrime ring launched a highly sophisticated Bitcoin phishing campaign called HaoBao that targeted global financial organizations, sending an email attachment that would scan for Bitcoin activity, credentials and mining data.
Chief Technology Officer at McAfee Steve Grobman said, “Cybercriminals will gravitate to criminal activity that maximizes their profit. In recent quarters we have seen a shift to ransomware from data-theft,  as ransomware is a more efficient crime. With the rise in value of cryptocurrencies, the market forces are driving criminals to crypto-jacking and the theft of cryptocurrency. Cybercrime is a business, and market forces will continue to shape where adversaries focus their efforts.”

关于python - 提取两个不同标签之间的文本 beautiful soup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51120802/

相关文章:

python - 尝试下载 gzip 文件时出现 urlopen 问题

javascript - 以php形式更改一个输入单词的颜色

python-3.x - 如何修复 'Can' t 分配请求的地址'。即使我在 mac 上用 python 尝试了许多不同的端口

python - 如何覆盖 Python 的 'str' 以便它返回大写值?

python - 操作参数在sql/python中必须是str

python - 如何在 Python 中解析损坏的 XML?

jquery - 第二个选择器悬停在多级深度菜单中不起作用

python - 无法安装 quandl

python - 嵌套默认字典

javascript - AngularJS 模型日期属性 - 向服务器提交了错误的值?