python - 如何在 Python 中使用 BeautifulSoup 提取标签内的文本？

假设我有这样一个 html 字符串:

<html>
    <div id="d1">
        Text 1
    </div>
    <div id="d2">
        Text 2
        <a href="http://my.url/">a url</a>
        Text 2 continue
    </div>
    <div id="d3">
        Text 3
    </div>
</html>

我想提取d2 中NOT 被其他标签包裹的内容，跳过一个url。换句话说，我想得到这样的结果:

Text 2
Text 2 continue

有没有办法用 BeautifulSoup 做到这一点？

我试过了，但是不正确:

soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)

最佳答案

尝试使用 .find_all(text=True, recursive=False):

from bs4 import BeautifulSoup
div_test="""
<html>
    <div id="d1">
        Text 1
    </div>
    <div id="d2">
        Text 2
        <a href="http://my.url/">a url</a>
        Text 2 continue
    </div>
    <div id="d3">
        Text 3
    </div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space

它将返回一个只有文本的列表:

[u'\n        Text 2\n        ', u'\n        Text 2 continue\n    ']
[u'Text 2', u'Text 2 continue']

关于python - 如何在 Python 中使用 BeautifulSoup 提取标签内的文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44858226/

上一篇：python - Pandas Iterrows 行数和百分比

下一篇：python - 如何使用groupby计算vwap(成交量加权平均价格)并申请？

相关文章：

python beautifulsoup iframe文档html提取

javascript - 有没有办法用beautiful soup过滤掉JavaScript中嵌入的product id

python - 如何用纯 Python 表达这个 Bash 命令

python - 使用 MongoEngine 动态文档将 None 保存到 MongoDB 中

Python3 + Gunicon 19.9 + Django 2.0 不打印我的日志，但打印 django.request

python - IAM 凭证错误 404

python - BeautifulSoup 标签去除

python - Matlab 到 python 转换结果不相等

python - Beautifulsoup css数据提取

python - 试图从谷歌抓取图像