python - 在 BeautifulSoup 中查找不同的字符串并返回包含标签

假设我有以下 HTML:

<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>

我希望能够找到包含我要查找的所有关键字的所有标签。例如。 (示例 2 和 3 将不起作用):

>>> len(soup.find_all(text="world"))
2

>>> len(soup.find_all(text="world puzzle"))
1

>>> len(soup.find_all(text="world puzzle book"))
0

我一直在尝试想出一个正则表达式，让我可以搜索所有关键字，但似乎 ANDing 是不可能的(只能是 ORing)。

提前致谢!

最佳答案

进行这种复杂匹配的最简单方法是 write a function that performs the match , 并将函数作为 text 参数的值传入。

def must_contain_all(*strings):                                                 
    def must_contain(markup):                                                   
        return markup is not None and all(s in markup for s in strings)         
    return must_contain

现在你可以得到匹配的字符串:

print soup.find_all(text=must_contain_all("world", "puzzle"))
# [u"\nWho in the world am I? Ah, that's the great puzzle.\n"]

要获取包含字符串的标签，请使用 .parent 运算符:

print [text.parent for text in soup.find_all(text=must_contain_all("world", "puzzle"))]
# [<p>Who in the world am I? Ah, that's the great puzzle.</p>]

关于python - 在 BeautifulSoup 中查找不同的字符串并返回包含标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11678078/

上一篇：python - 它有内存泄漏吗？

下一篇：python 模块驻留在名称包含短划线字符的存储库中

相关文章：

python - Django 类覆盖失败系统检查

python - 如何从 NFL 赛程表中抓取所有 td 和 tr 数据

python - 使用 Python 和 Beautiful Soup 从 .html 文件中提取文本，删除 HTML，然后写入文本文件

python - 为什么Python标准库中没有读写文件的函数

python - 在具有 if 条件的整个数据帧上使用 pandas 中的 applymap

'float' 类型的 Python Selenium send_key 对象没有 len()

python - django-allauth 社交帐户在登录时连接到现有帐户

python - soup.findAll 返回空列表

Python:BeautifulSoup 在读取时自动更改文本？

python - Selenium 和 BeautifulSoup : sharing and pulling session data resources to multiple libraries in python