python - 根据之前的标签使用 BeautifulSoup 解析 HTML

标签 python html parsing html-parsing beautifulsoup

我有一个 HTML，其中一些标题后面有一些标记文本。像这样的事情:

<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>

<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>

<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>

(唯一固定的是标题数量，其余可以改变)

如何使用 BeautifulSoup 提取每个后面但在其余部分之前的所有 HTML？

最佳答案

您可以pass a regular expression 标题\d+ 为 text argument并找到所有标题，然后使用 find_next_siblings()获取接下来的两个 p 标签:

import re
from bs4 import BeautifulSoup

data = """
<div>
    <h1>Title 1</h1>
    <p>Some text</p>
    <p>Some other <b>text</b></p>

    <h1>Title 2</h1>
    <p>Some <b>text</b></p>
    <p>Some text2</p>

    <h1>Title 3</h1>
    <p>Some text</p>
    <p>Some other <i>text</i></p>
</div>
"""

soup = BeautifulSoup(data)

for h1 in soup.find_all('h1', text=re.compile('Title \d+')):
    for p in h1.find_next_siblings('p', limit=2):
        print p.text.strip()

打印:

Some text
Some other text
Some text
Some text2
Some text
Some other text

或者，使用列表理解:

print [p.text.strip()
       for h1 in soup.find_all('h1', text=re.compile('Title \d+'))
       for p in h1.find_next_siblings('p', limit=2)]

打印:

[u'Some text', u'Some other text', u'Some text', u'Some text2', u'Some text', u'Some other text']

关于python - 根据之前的标签使用 BeautifulSoup 解析 HTML，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24966829/

上一篇：python - 如何在BeautifulSoup中将html提取为文本？

下一篇：python - 子类化 datetime.datetime

相关文章：

Python for 循环声明和行为

python - 如何绘制一定粗细的抗锯齿圆形线？如何设置 pygame.gfx.aacircle() 的宽度？

javascript - 动态添加和填充选择框

html - Bootstrap 面板不会在平板电脑 View 中水平对齐

regex - sed 用于用 ""替换特殊字符

python - 平均迭代列表

python - 如何沿轴展平一个 numpy ndarray？

javascript - html 地理位置 : Unknown error acquiring position

parsing - ANTLR 语法也可以识别数字键和整数

java - java解析JSON字符串，最佳实践