Python/BeautifulSoup - 提取 div 内容检查 h1 文本

我有一个像这样的 html 页面:

<div class="class1">
   <div class="head">
      <h1 class="title">Title 1</h1>
   <div class="body">
<!-- some body content -->
   </div>
   </div>
</div>

<div class="class1">
   <div class="head">
      <h1 class="title">Title 2</h1>
   <div class="body">
<!-- some body content -->
   </div>
   </div>
</div>

仅当标题等于“Title 2”时，我才需要使用class body从div中提取内容。由于它们的父容器没有特定的 id 或类，因此 h1 文本是识别 div 的全部内容的唯一方法。目前我使用这段代码:

from bs4 import BeautifoulSoup

# code to open the webpage
soup = BeautifulSoup(data, 'lxml')
body_content = soup.findAll('div', {'class':'class1'})[1]

但这不是很优雅，因为它假设我感兴趣的 div 始终是页面的第二个 - 它不检查标题。

最佳答案

嗯，我能想到的唯一解决方案如下:

soup = BeautifulSoup(html,"html.parser")
    result_tags = soup.find_all(name='div',class_='class1')
    body_content = [tag for tag in result_tags if 'Title 2' in tag.prettify()][0]

它比您的原始代码更好，因为它不假设您的目标 div 是页面的第二个。

关于Python/BeautifulSoup - 提取 div 内容检查 h1 文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40883881/

上一篇：python - 使用 pandas 从文本中确定上下文

下一篇：多处理中的 Python 日志记录

相关文章：

javascript - 分块文件上传的文件大小不同/已损坏

Python urllib2 恢复下载在网络重新连接时不起作用

python - 如何从集合中删除值为 None 的变量？

python - 使用嵌套框架和 javascript 进行网页抓取

python beautiful-soap json - 抓取一页但不抓取其他类似的页面

python - xtensor 和 xsimd : improve performance on reduction

javascript - 即使 javascript 函数返回 false，表单也会提交

jquery - 如何使用jQuery隐藏和显示div

javascript - 如何将 'Results Per Page' 下拉转换为链接

python - BeautifulSoup 无法使用正确的编码读取 javascript 中的 html