html - 使用 BeautifulSoup 或 XPATH 获取内容属性值对

对于以下 xhtml 片段，我需要使用 BS4 或 xpath 从结构化 html 中获取属性值对，属性名称出现在 h5 标记中，其值位于 span 标记或 p 标记中。

对于下面的代码，我应该得到以下输出作为字典:

畜牧管理:“动物:奶牛农:史密斯先生”

牛奶类别:“牛奶供应”

服务:“牛奶、酥油”

动物颜色:'红色、绿色...'

<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>

htmlcode.findAll('h5') 查找 h5 元素，但我想要 h5 元素和另一个 'h5' 之前的后继元素

最佳答案

使用lxml.html和XPath的示例解决方案:

选择所有 h5 元素
对于每个 h5 元素，
1. 选择下一个兄弟元素 -- following-sibling::*
2. 不是 h5 本身，-- [not(self::h5)]
3. 并且直到当前 sibling 之前的 h5 数字 - [count(preceding-sibling::h5) = 1] 然后是 2，然后是 3.. .

(for 循环 enumerate() 从 1 开始)

示例代码，简单打印元素的文本内容(在元素上使用 lxml.html 的 .text_content()):

import lxml.html
html = """<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>"""
doc = lxml.html.fromstring(html)
headers = doc.xpath('//div/h5')
for i, header in enumerate(headers, start=1):
    print "--------------------------------"
    print header.text_content().strip()
    for following in header.xpath("""following-sibling::*
                                     [not(self::h5)]
                                     [count(preceding-sibling::h5) = %d]""" % i):
        print "\t", following.text_content().strip()

输出:

--------------------------------
Husbandary Management
    Animal: Cow
    Farmer: Mr smith
--------------------------------
Milch Category
    Milk supply
--------------------------------
Services
    cow milk, ghee
--------------------------------
animal colors
    green,red

关于html - 使用 BeautifulSoup 或 XPATH 获取内容属性值对，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23215028/

html - 使用 BeautifulSoup 或 XPATH 获取内容属性值对

上一篇：Java2Word 导出，单独格式化表格单元格

下一篇：google-chrome - 我可以在 iOS 设备上的 Chrome 中使用“添加到主屏幕”吗？