Python:打印/获取每个段落的第一句

标签 python text beautifulsoup

这是我的代码,但它打印整个段落。如何只打印第一个句子,直到第一个点?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

此代码打印:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.

但我只想打印:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.

感谢帮助

最佳答案

在该点上分割文本;对于单个分割,使用 str.partition() str.split() 快有限制:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

如果您只需要处理第一个 <p>元素,使用soup.find()相反:

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

但是,对于您给定的 URL,示例文本位于第二段落中:

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'

关于Python:打印/获取每个段落的第一句,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35292754/

相关文章:

python - 使用 BeautifulSoup 进行网页抓取 - 无法提取表行

Python BeautifulSoup 返回空列表

python - 从具有 CloudFlare 的网站中抓取(BeautifulSoup,请求)

python - python 中的 strncmp

python - 无需安装即可使用 nltk

Haskell - 找出文本中最长的单词

jquery - 如何提取<a></a>之间的文本

python - python网络通信程序

Python:如何在我的测试套件中制作临时文件?

bash - 如何反转文本 block 的顺序