Python lxml iterparse 按属性排序大型 xml 文件

标签 python xml sorting lxml iterparse

我有一个很大的 XML 文件,我正在尝试为每个程序对图标进行排序,我想按宽度属性中的值降序对图标进行排序,我已经设法删除了某些不需要的图标但我不确定如何订购图标,任何帮助将不胜感激。

这是我用来删除我不想要的图标的代码,但我不确定如何订购它们。我使用 iterparse 因为读取整个文件会占用大量内存。

当前删除代码:

import lxml.etree as ET
xml_source = 'ss_sky_sw_xmltv.xml'
xml_output = 'ss_sky_sw_xmltv_parsed.xml'

context = ET.iterparse(xml_source, encoding='iso-8859-1', tag='icon')
for event, elem in context:
    if elem.getparent().tag != 'channel' :
        if elem.tag == 'icon':
            if elem.attrib['width'] == '180' and elem.attrib['height'] == '135':
                elem.getparent().remove(elem)
            elif elem.attrib['width'] == '120' and elem.attrib['height'] == '180':
                elem.getparent().remove(elem)
ET.ElementTree(context.root).write(xml_output, xml_declaration=True)

XML 文件:

<tv source-info-name="Schedules Direct" generator-info-name="mc2xml" generator-info-url="mailto:mc2xml@gmail.com">
    <channel id="I963.24337.schedulesdirect.org">
        <display-name>963 BBC1SE</display-name>
        <display-name>963</display-name>
        <display-name>BBC1SE</display-name>
        <display-name>BBC One South East</display-name>
        <display-name>BBC1</display-name>
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/stationLogos/s24337_h3_aa.png" width="360" height="270" />
    </channel>
    <channel id="I964.24326.schedulesdirect.org">
        <display-name>964 BBC1STH</display-name>
        <display-name>964</display-name>
        <display-name>BBC1STH</display-name>
        <display-name>BBC One South</display-name>
        <display-name>BBC1</display-name>
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/stationLogos/s24326_h3_aa.png" width="360" height="270" />
    </channel>
    <programme start="20191007150000 +0100" stop="20191007154500 +0100" channel="I101.24327.schedulesdirect.org">
        <title lang="en">Escape to the Perfect Town</title>
        <sub-title lang="en">Knaresborough, North Yorkshire</sub-title>
        <desc lang="en">Steve Brown helps a couple feeling the pinch of the London property market to decide on their perfect town and the right property in which to raise their young children. They're amazed by what their £280,000 budget can buy them out of the capital city, and that moving to a desirable town means buzzing high streets, great community spirit and green spaces, as well as a quick commute to York for a teaching job is all on their doorstep.</desc>
        <credits>
            <producer>John Comerford</producer>
            <producer>Eleanor Brocklehurst</producer>
        </credits>
        <date>20191007</date>
        <category lang="en">House/garden</category>
        <category lang="en">Home improvement</category>
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_v6_aa.jpg" width="120" height="180" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_v2_aa.jpg" width="135" height="180" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h5_aa.jpg" width="180" height="135" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h14_aa.jpg" width="240" height="135" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_v5_aa.jpg" width="240" height="360" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_v3_aa.jpg" width="270" height="360" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_h3_aa.jpg" width="360" height="270" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h13_aa.jpg" width="480" height="270" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_v7_aa.jpg" width="480" height="720" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_v4_aa.jpg" width="540" height="720" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h6_aa.jpg" width="720" height="540" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_h12_aa.jpg" width="960" height="540" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_h11_aa.jpg" width="1280" height="720" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_v8_aa.jpg" width="960" height="1440" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_v9_aa.jpg" width="1080" height="1440" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h9_aa.jpg" width="1440" height="1080" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p17421608_st_h10_aa.jpg" width="1920" height="1080" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p17421608_st_h2_aa.jpg" width="2048" height="1024" />
        <episode-num system="dd_progid">EP03325404.0001</episode-num>
        <episode-num system="xmltv_ns">0.0.</episode-num>
        <new />
    </programme>
    <programme start="20191007154500 +0100" stop="20191007163000 +0100" channel="I101.24327.schedulesdirect.org">
        <title lang="en">Make Me a Dealer</title>
        <sub-title lang="en">Liverpool: Sarah &amp; Marika</sub-title>
        <desc lang="en">Paul Martin teaches two antiques lovers the tricks of the trade and turns them into successful antiques dealers. In Liverpool, hairdresser Sarah faces off against civil servant Marika.</desc>
        <credits>
            <director>Gabe Crozier</director>
            <director>Dan Donnelly</director>
            <producer>Paul Tucker</producer>
            <producer>Carole Lochhead</producer>
            <producer>Jo Dunscombe</producer>
            <presenter>Paul Martin</presenter>
        </credits>
        <date>20191007</date>
        <category lang="en">How-to</category>
        <category lang="en">Collectibles</category>
        <category lang="en">Art</category>
        <category lang="en">Arts/crafts</category>
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v6_aa.jpg" width="120" height="180" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_v2_aa.jpg" width="135" height="180" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_h5_aa.jpg" width="180" height="135" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_h14_aa.jpg" width="240" height="135" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v5_aa.jpg" width="240" height="360" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v3_aa.jpg" width="270" height="360" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_h3_aa.jpg" width="360" height="270" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_h13_aa.jpg" width="480" height="270" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v7_aa.jpg" width="480" height="720" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_v4_aa.jpg" width="540" height="720" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_h6_aa.jpg" width="720" height="540" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_h12_aa.jpg" width="960" height="540" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_h11_aa.jpg" width="1280" height="720" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v8_aa.jpg" width="960" height="1440" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_v9_aa.jpg" width="1080" height="1440" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_h9_aa.jpg" width="1440" height="1080" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_h10_aa.jpg" width="1920" height="1080" />
        <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p16172084_b_v12_aa.jpg" width="1920" height="2880" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_v13_aa.jpg" width="2160" height="2880" />
        <icon src="https://json.schedulesdirect.org/20141201/image/assets/p16172084_b_s4_aa.jpg" width="3000" height="3000" />
        <episode-num system="dd_progid">EP03082486.0021</episode-num>
        <episode-num system="xmltv_ns">1.0.</episode-num>
        <new />
    </programme>
</tv>

最佳答案

import lxml.etree as ET
from copy import deepcopy

xml_source = 'ss_sky_sw_xmltv.xml'
xml_output = 'ss_sky_sw_xmltv_parsed.xml'
# icons with these dimensions (width, height) will be removed:
remove_dimensions = (
    (180, 135),
    (120, 180),
    )

tree = ET.parse(xml_source)
root = tree.getroot()
for programme in root.iterfind('programme'):
    # Create copy of all icons to reinsert them in the right order
    icons = deepcopy(sorted(programme.findall('icon'), key=lambda x: int(x.attrib['height'])))
    # Remove all icons from programme
    for old_icon in programme.findall('icon'):
        programme.remove(old_icon)

    # Reinsert the items
    for new_icon in icons:
        # Create a dict to compare
        dimensions = int(new_icon.attrib['width']), int(new_icon.attrib['height'])
        # Compare the dict if it should be removed (not included again)
        if dimensions not in remove_dimensions:
            programme.append(new_icon)

# Save the file
tree.write(xml_output, xml_declaration=True, pretty_print=True)

关于Python lxml iterparse 按属性排序大型 xml 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58512213/

相关文章:

sorting - 我们真的需要在 MapReduce 框架中进行排序吗?

python - DJANGO_SETTINGS_MODULE 未使用 gis.db 定义

xml - 如何在带有 XSLT 的 XML 文档中获取根元素的标签名称?

xml - .NET 3.5 XPath 类和方法是否与 XSLT 2.0 兼容?

node.js - 在 NodeJs 中合并一个非常大的列表的最佳方法是什么?

sorting - 通过 populateState 方法对 joomla 中的列进行排序

python - 根据最后一个值在数据框列中填充 NaN

python - discord.py send_message 用法

python - 给定 UTC 时间戳和 UTC 偏移量,是否可以在 Python 中获取时区?

Java,由具有特殊字符的 XML 属性引起的 UnmarshallingException : ;ìè+òàù-<^èç°§_>! £$%&/()=? ~`'#;