python - 如何在 Python 中通过 LXML 引用父元素并删除 RSS XML 中的父元素？

我一直无法破解这个。我有一个 XML 文件形式的 RSS 提要。简化后，它看起来像这样:

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>

我的目标是检查第二个描述标签是否包含某些字符串。如果它确实包含该字符串，我想将其完全删除。目前在我的代码中我有这个:

doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')


for desc in found:
    if "FORBIDDENSTRING" in desc.text:
        desc.getparent().remove(desc)

它只删除了第二个描述标签，这是有道理的，但我希望整个 item 都消失。如果我只有“desc”引用，我不知道如何获取“item”元素。

我已经尝试使用谷歌搜索以及在此处搜索，但我看到的情况只是想像我现在所做的那样删除标签，奇怪的是我没有偶然发现想要删除整个父对象的示例代码. 非常欢迎任何指向文档/教程或帮助的指针。

最佳答案

我是 XSLT 的忠实粉丝，但另一种选择是只选择 item 而不是 description(选择要删除的元素；而不是它的 child )。

此外，如果您使用 xpath()，您可以将禁止字符串的检查直接放在 xpath 谓词中。

例子...

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))

这打印...

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>

关于python - 如何在 Python 中通过 LXML 引用父元素并删除 RSS XML 中的父元素？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50396492/

python - 如何在 Python 中通过 LXML 引用父元素并删除 RSS XML 中的父元素？

上一篇：Python 安装工具 : how to call function on import but not when run from script?

下一篇：python - 将单索引数据框添加到多索引数据框、Pandas、Python