python - 使用 BeautifulSoup 根据内容值提取标签内容

标签 python beautifulsoup html-content-extraction

我有一个如下格式的 Html 文档。

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

我想提取段落标签的内容，包括斜体和粗体标签的内容，但不包括 anchor 标签的内容。此外，可能会在开始时忽略数字。

预期的输出是: 斜体但不粗的段落内容。

最好的方法是什么？

此外，以下代码片段返回 TypeError: argument of type 'NoneType' is not iterable

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

感谢您的建议。

最佳答案

您的代码失败，因为如果标签只有一个子标签且该子标签是 NavigableString，则设置了 tag.string

你可以通过提取a标签来实现你想要的:

from BeautifulSoup import BeautifulSoup

s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

for p in soup.findAll('p'):
    for a in p.findAll('a'):
        a.extract()
    print ''.join(p.findAll(text=True))

关于python - 使用 BeautifulSoup 根据内容值提取标签内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8909481/

上一篇：python - 正则表达式执行时间在第 100 个模式后急剧跳跃

下一篇：python - 在 python 中计算体积或表面积的好算法

相关文章：

python - BeautifulSoup 返回空的 span 元素？

python-3.x - 使用 CSV 查询网站时出现问题，输入不正确

java - 如何以编程方式获取 contentPlaceHolder 的代码

python - 使用 VirusTotal API 扫描文件

python - 为什么这个 python 生成器函数只能正确运行一次？

python - 如何从链接列表中抓取？

jquery - 关于可读性代码的 jQuery 等效项有什么想法吗？ (或: building the best heuristic to find the main text using jQuery)

regex - 如何使用 RegEx 从 HTML 中提取值？

python - 计算每个数字的4次方之和，为什么会得到错误的结果？

python - 在python中对字符串值进行排序