python - 基于匹配的 XML 删除元素

我喜欢根据子元素匹配删除元素。
file.xml 示例:

 <entry>
  <title>TEST1</title>
  <profile>
    <title>Default</title>
    <pid>
      <pidNumber>1880</pidNumber>
      <ContentType>PMT</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>201</pidNumber>
      <ContentType>Video</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>301</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>302</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>310</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
  </profile>
</entry>

如您所见，有很多 PIDS 值 (201,301,302-310)，我想删除所有与 302-310 匹配的 pids。这是我的代码，但出现错误。

# -*- coding: utf-8 -*-
import re
from xml.etree import ElementTree as ET

root = ET.parse("file.xml").getroot()
regex = r"[3][0-1][02-9]"
getpid = root.iter("pid")

for item in getpid:
    pidnum = item.find('.//pidNumber')
    pidnum = pidnum.text
    match = re.findall(regex, pidnum)
    match = ''.join(match)
    if pidnum == match:
        ET.dump(item)
        item.remove(getpid)

tree = ET(root)
tree.write("out.xml")

我得到的错误:

self._children.remove(element)
ValueError: list.remove(x): x not in list`

如何解决？我想我很接近。
感谢您的查看和帮助。

最佳答案

I want to remove all the pids that matches from 302-310.

我认为您的正则表达式逻辑有缺陷。如果您的 pidNumber 是 319(或 312、313 等)，这些 pid 元素也将被删除。

此外，您的代码并没有完全删除 pid，而是删除了它的子元素，留下一个空的 pid 元素。 (也许这是需要的，但它听起来不像是基于“我喜欢根据子元素匹配删除元素。”。)

不使用 getroot()，尝试使用 find() 获取 profile 元素。这是 pid 的父级，这是我们删除 pid 本身所需要的。

而不是使用正则表达式来匹配 pidNumber，只需进行基本比较即可。

例子...

file.xml(添加了额外的 pid 元素用于测试)

<entry>
    <title>TEST1</title>
    <profile>
        <title>Default</title>
        <pid>
            <pidNumber>1880</pidNumber>
            <ContentType>PMT</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>201</pidNumber>
            <ContentType>Video</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>301</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>302</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>303</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>309</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>310</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>319</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
    </profile>
</entry>

python

from xml.etree import ElementTree as ET

tree = ET.parse("file.xml")
profile = tree.find("profile")

for pid in profile.findall(".//pid"):
    nbr = int(pid.find("pidNumber").text)
    if 302 <= nbr <= 310:
        profile.remove(pid)

tree.write('out.xml')

out.xml

<entry>
    <title>TEST1</title>
    <profile>
        <title>Default</title>
        <pid>
            <pidNumber>1880</pidNumber>
            <ContentType>PMT</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>201</pidNumber>
            <ContentType>Video</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>301</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>319</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
    </profile>
</entry>

另一种选择是使用 lxml 而不是 ElementTree。这将为您提供完整的 xpath 支持，以便您可以在谓词中进行比较。

使用上面的 file.xml 输入，下面的 python 生成与上面相同的 out.xml 输出。

from lxml import etree

tree = etree.parse("file.xml")
for pid in tree.xpath(".//pid[pidNumber[. >= 302][310 >= .]]"):
    pid.getparent().remove(pid)

tree.write("out.xml")

第三种选择是使用 XSLT(感谢@Parfait 的建议)...

python

from lxml import etree

tree = etree.parse("file.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
new_tree.write_output("out_xslt.xml")

XSLT 1.0 (测试.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="pid[pidNumber[. >= 302][310 >= .]]"/>

</xsl:stylesheet>

同样，这会产生与使用相同输入的其他选项相同的结果。

关于python - 基于匹配的 XML 删除元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50078222/

python - 基于匹配的 XML 删除元素

上一篇：python - 外语的词云或可视化

下一篇：python - 在我的 python 项目中使用我自己的模块的最佳方式