python - python中仅从html文档中获取主要文本

标签 python html parsing beautifulsoup

我有url (几个之一)我希望解析并获取正文。我能够使用以下代码成功解析它

url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"

import requests

url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")

for script in soup(["script", "style"]):
    script.extract()         
text = soup.get_text()
text.encode('ascii', 'ignore')

print(text)

我收到的文字就像

波音熊觉醒 - 波音公司 (NYSE:BA) |寻求 Alpha 市场寻求 Alpha 订阅投资组合我的投资组合所有投资组合+ 创建投资组合模型投资组合人物新闻分析立即登录/加入帮助知识库反馈论坛快速选择和列表 |工业波音熊醒来美国东部时间 2019 年 6 月 9 日上午 6:30||关于:波音公司 (BA) 作者:Dhierin BechaiDhierin Bechai 航空航天、航空公司、商用飞机市场航空航天论坛摘要波音产量暂时减少。关于减少的持续时间知之甚少,但降低生产率的决定可能是长期停飞的迹象。生产率下降加剧了波音股价的下行空间。随着波音 (NYSE:BA) 737 MAX 机队停飞以及向客户交付的停止,波音感受到了来自两方面的压力。虽然保险公司承保了部分损失,

它包含所有分割,例如订阅、关于、时间、加入等

我需要两个方面的帮助:

  1. 是否有一种通用方法可以仅解析正文,无需其他元素
  2. 附加元素,我可以单独返回它吗,例如,如果我想知道文章的社交媒体影响有多大(在不同平台上点赞、评论、分享。

要检查通用性,请尝试 url2

感谢您一直以来的帮助。

最佳答案

您可以使用 script 标签提取 json 格式并使用它:

url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"

import requests
from bs4 import BeautifulSoup
import json

url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")

for script in soup(["script"]):
    if 'window.SA = ' in script.text:

        jsonStr = script.text.split('window.SA = ')[1]
        jsonStr = jsonStr.rsplit(';',1)[0]
        jsonObj = json.loads(jsonStr)

title = jsonObj['pageConfig']['Data']['article']['title']
print (title)

里面有很多信息。并获取文章:

article = soup.find('div', {'itemprop':'articleBody'})
ps = article.find_all('p', {'class':'p p1'})
for para in ps:
    print (para.text)

输出:

The Boeing Bear Wakens

文章:

With the Boeing (NYSE:BA) 737 MAX fleet being grounded and deliveries to customers being halted, Boeing is feeling the heat from two sides. While insurers have part of the damages covered, it is unlikely that a multi-month grounding will be fully covered. Initially, it seemed that Boeing was looking for a relatively fast fix to minimize disruptions as it was relatively quick with presenting a fix to stakeholders. Based on that quick roll-out, it seemed that Boeing was looking to have the fleet back in the air within 3 months. However, as the fix got delayed and Boeing and the FAA came under international scrutiny, it seems that timeline has slipped significantly as additional improvements are to be made. Initially, I expected that Boeing would be cleared to send the 737 MAX back to service in June/July, signalling a 3-4-month grounding and expected that Boeing's delivery target for the full year would decline by 40 units.



Source: Everett Herald
On the 5th of April, Boeing announced that it would be reducing the production rate for the Boeing 737 temporarily, which is a huge decision:
As we continue to work through these steps, we're adjusting the 737 production system temporarily to accommodate the pause in MAX deliveries, allowing us to prioritize additional resources to focus on software certification and returning the MAX to flight. We have decided to temporarily move from a production rate of 52 airplanes per month to 42 airplanes per month starting in mid-April.

您还可以获取评论的json响应:

url = 'https://seekingalpha.com/account/ajax_get_comments?id=4253393&type=Article&commentType=topLiked'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}


jsonObj_comments = requests.get(url, headers=headers).json()

就通用方法而言,这将很困难,因为每个网站都有自己的结构、格式、标签和属性名称的使用等。但是,我确实注意到您提供的两个网站都使用 <p>他们的文章的标签,所以我想你可以从这些标签中提取文本。但是,使用通用方法,您将获得一些通用输出,这意味着您可能有过多的文本,或者文章中缺少一些内容。

import requests
from bs4 import BeautifulSoup

url1 = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"
url2 = "https://www.dqindia.com/accenture-helps-del-monte-foods-unlock-innovation-drive-business-growth-cloud/"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

url = requests.get(url1, headers=headers)
html = url.text
soup = BeautifulSoup(html, "html.parser")

paragraphs = soup.find_all('p')

for p in paragraphs:
    print (p.text)

关于python - python中仅从html文档中获取主要文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55647657/

相关文章:

c# - 如何将 python 回调传递给 c# 函数调用

python - 如何使用python中的行号从文本文件中删除一行

html - Section 标签不包含标题可以吗?

html - 如何用百分比将 "colspan"div表

javascript - 如何在两个 parent 之间转换一个div?

jquery - 为什么一个数字是有效的 json?

python - 在 Pandas 数据框上使用 polyfit,然后将结果添加到新列

c# - 在页面上调用 javascript 方法

我可以用 popen() 解析 ngrep 的输出吗?

delphi - 使用 TStringList 的分隔符解析字符串,似乎也解析空格(Delphi)