python - BeautifulSoup 排除特定标签内的内容

标签 python html beautifulsoup html-parsing lxml

我有以下项目来查找段落中的文本:

soup.find("td", { "id" : "overview-top" }).find("p", { "itemprop" : "description" }).text

我如何排除 <a> 中的所有文本？标签？类似于 in <p> but not in <a> ？

最佳答案

查找并加入所有text nodes在 p 标签中检查它的父标签是否不是 a 标签:

p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})

print ''.join(text for text in p.find_all(text=True) 
              if text.parent.name != "a")

演示(未看到打印的链接文本):

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <td id="overview-top">
...     <p itemprop="description">
...         text1
...         <a href="google.com">link text</a>
...         text2
...     </p>
... </td>
... """
>>> soup = BeautifulSoup(data)
>>> p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})
>>> print p.text

        text1
        link text
        text2
>>>
>>> print ''.join(text for text in p.find_all(text=True) if text.parent.name != "a")

        text1

        text2

关于python - BeautifulSoup 排除特定标签内的内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27610221/

上一篇：Python Virtualenv 检查环境

下一篇：python - 在 Python 中创建类后向其添加方法

相关文章：

python - 如何为数独游戏添加另一个条件

javascript - 如何避免在 anchor (<a></a>)标签点击时刷新页面？

html - 完整的浏览器宽度列表，但文本具有最大宽度

python - 抓取 Google Play 商店 BeautifulSoup/Selenium

python - 如何使用 BeautifulSoup (python) 跳过 <ul> 的第一个元素？

python - 尝试有条件地将元组列表中的元组附加到字典中元组的子列表

python - while循环python

python - 从子流程中重新引发异常

php - 使用 https 确保 html 表单安全

Python 问题 : TypeError: unhashable type: 'slice' during web scraping