python - 如何通过python和beautifulsoup找到各种网站标签？

我想检查所有标签的内部 HTML 文本内容。

例如:

<a>
    Hello World
</a>
<div>
    Wow!
</div>

我想要得到“Hello World”和“Wow!”。

我知道我可以使用 .findChildren(['a', 'div'])。然而，真实的网站包含多种标签，例如“p”、“td”和“tr”。所以我认为.findChildren不是解决问题的有效方法。

目前，我认为递归和 .find_all_next() 可能会帮助我解决这个问题。但我不知道如何实现它。同样，我也不太确定我的想法是否可行。

请给我一些提示，以便我找到答案!

非常感谢您的帮助! :)

最佳答案

您可以使用strings或stripped_strings来提取标签内的文本:

for string in soup.stripped_strings:
    print(repr(string))

来自documentation :

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead.

或者您可以使用.get_text()方法:

print(soup.get_text())

关于python - 如何通过python和beautifulsoup找到各种网站标签？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35717028/

上一篇：python - python 中嵌套循环的 pretty-print 输出

下一篇：python - 使用正则表达式在Python中用数字替换连续符号

相关文章：

python - 傅里叶变换与 Numpy FFT

python - Pandas:将数据 append 到 pandas 数据框中的问题

java - 使用递归的二分搜索

php - 理解 PHP/mysql 中的递归

html - 为什么通过开发者工具查看的html和通过beautifulsoup获取的html不一样？

python - 为什么一个网站的 HTML 和 Python 的请求库给的 HTML 不一样？

python - 将 DataFrame 中的值插入另一个 DataFrame

python - 打印类中的所有属性

compiler-construction - 一种语言的编译器如何用该语言编写？

python - Beautiful Soup 为特定的 div 找到 child