python - 在 BeautifulSoup 中提取带换行符的文本

标签 python web beautifulsoup

我想使用 BeautifulSoup 提取带换行符的文本以及“br”标签。

html = "<td class="s4 softmerge" dir="ltr"><div class="softmerge-inner" style="width: 5524px; left: -1px;">But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them, <br/>O generation of vipers, who hath warned you to flee from the wrath to come?<br/>Bring forth therefore fruits meet for repentance:<br/>And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.<br/>And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.<br/>I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:<br/>Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.</div></td>"

我想在字符串中得到这样的结果;

But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them,
O generation of vipers, who hath warned you to flee from the wrath to come?
Bring forth therefore fruits meet for repentance:
And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.
And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.
I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:
Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.

如何编码才能得到这个结果?

最佳答案

抱歉,如果这不是您要找的,但您可以尝试 replaceregex .

例如,您可以通过创建一个查找所有 <br> 的过滤器来使用正则表达式。标记并用换行符替换它们( \n )。

如果您使用的是 BeautifulSoup 对象,我相信您需要使用它的 string属性:html = soupelement.string .

import re
regex = re.compile(r"<br/?>", re.IGNORECASE) # the filter, it finds <br> tags that may or may not have slashes
html = 'blah blah b<br>lah <br/> bl<br/>' 
newtext = re.sub(regex, '\n', html) # replaces matches with the newline
print(newtext)
# Returns 'blah blah b\nlah \n bl\n' !

关于python - 在 BeautifulSoup 中提取带换行符的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53145500/

相关文章:

python - value_counts 在 DataFrame 中返回 float

python - 关于python OpenOPC库的两个问题

html - Css Drop Line 菜单栏问题

javascript - 单击时 HTML 下拉菜单不会保持打开状态

java - 如何处理海量的网页抓取请求

python - 从 HTML 中提取脚本标签内的字符串

python - 使用 Python 3 生成比赛时间表

Python Beautiful Soup find_all

python - 在 Beautifulsoup Python 上排除不需要的标签

python - 为什么我的用户注册页面给出 'The view didn' 并返回 HttpResponse 对象。它返回 None 相反。错误?