python - 使用 BeautifulSoup 提取 HTML 注释之间的文本

标签 python python-3.x web-scraping beautifulsoup

使用 Python 3 和 BeautifulSoup 4，我希望能够从 HTML 页面中提取仅由其上方的注释描述的文本。一个例子:

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

我找到了多种方法来提取页面的文本或评论，但无法实现我想要的效果。任何帮助将不胜感激。

最佳答案

您只需遍历所有可用的评论，看看它是否是您需要的条目之一，然后显示以下元素的文本，如下所示:

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

这将显示以下内容:

I would like to get this text
I would also like to find this text

关于python - 使用 BeautifulSoup 提取 HTML 注释之间的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34673851/

上一篇：python - 我对 Connectwise Rest API 的补丁请求有什么问题？

下一篇：python - 为什么函数定义中的默认变量差一个？

相关文章：

python - 只能使用关联表对象定义二级关系，不能使用名称

python - 获取 xpath() 以返回空值

python - 从依赖于交互式 map 的表中抓取数据

python - Beautifulsoup:当我尝试使用 Beautifulsoup4 访问 soup.head.next_sibling 值时换行

python - 无法使用 httplib2 提交没有文件的多部分表单

python - 如何删除 Python 三重引号多行字符串的额外缩进？

python - 使用数据类型从 NumPy 数组中切片一列？

python - 这种将线程池与 Tornado 一起使用的风格可以吗？

python-3.x - .sendkeys 方法无法使用 Python Selenium 上传文件

python - 避免在 multiprocessing.Pool worker 中使用全局变量来获取不可篡改的共享状态