python - BeautifulSoup:如何获取嵌套的 div

给定以下代码:

<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>

如何提取单词test来自 <div class="category5"> test使用 BeautifulSoup 即如何处理嵌套的 div？我试图在 Internet 上查找，但没有找到任何处理易于掌握示例的案例，因此我设置了这个。谢谢。

最佳答案

xpath 应该是直接的答案，但是 BeautifulSoup 不支持。

更新:使用 BeautifulSoup 解决方案

为此，假设您知道本例中的类和元素 (div)，您可以使用 for/loop使用 attrs 得到你想要的:

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class':'category5'}):
    print div.text

test

我从你的 html 示例中提取文本没有问题，就像@MartijnPieters 建议的那样，你需要找出你的 div 元素丢失的原因。

另一个更新

由于您缺少 lxml 作为 BeautifulSoup 的解析器，这就是返回 None 的原因，因为您没有开始解析任何内容。安装 lxml 应该可以解决您的问题。

您可以考虑使用 lxml 或支持 xpath 的类似工具，如果您问我的话，这很简单。

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class="category5"]/text()')
[' test\n                 ']

关于python - BeautifulSoup:如何获取嵌套的 div，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26627080/

python - BeautifulSoup:如何获取嵌套的 div

更新:使用 BeautifulSoup 解决方案

另一个更新

上一篇：python - Python Tkinter 的 PDF 查看器

下一篇：python - 在 Python 中为 : how can I get the exit status of the previous process run from Bash (i. e。 "$?")？