python - 转换嵌套 XML

标签 python xml csv

我目前正在寻求将嵌套的 XML 解析为 pandas 数据表,以便我可以生成一个 CSV,其中每列都是元素名称,其值是元素文本,但在解析信息时遇到一些问题。下面是嵌套 XML 的示例以及我尝试过的内容。

下面的 XML 可能非常大,包含数百条不同的记录。这是我尝试过的:

##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree

tree = ET.parse("File.xml")
root = tree.getroot()

for subelement in root:
    for subsub in subelement:
        print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())

for subelement in root:
    for subsub in subelement:
        for subsubsub in subsub:
            print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)
<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
    <test2 ID="123123123" test3="123123">
        <test3>Separate</test3>
        <test4>AA</test4>
        <Comments>BB</Comments>
        <test5>
            <test6 ID="123123">
                <test3>today</test3>
                <test7>123 street</test7>
            </test6>
        </test5>
        <test8>
            <test10 ID="434234">
                <test3>type of work</test3>
                <test9>test work</test9>
            </test10>
        </test8>
        <test11>
            <test12 ID="234234234">
                <test3>Social</test3>
                <test14>test</test14>
            </test12>
            <test12 ID="123123">
                <test3>Something Here</test3>
                <test13>Some date</test13>
                <test14>123123124433</test14>
            </test12>
        </test11>
        <test15>
            <test16 ID="6456456456">
                <test3>Something Something</test3>
                <test14>746745636</test14>
            </test16>
        </test15>
    </test2>
    <test2 ID="353453245" test3="list of something">
        <test3>Somewhere</test3>
        <test4>Someone</test4>
        <Comments>Some comment</Comments>
        <test5>
            <test6 ID="567456756">
                <test3>Not today</test3>
                <test7>5634643643</test7>
                <test17>Some Info</test17>
                <test19>Somewhere</test19>
                <test18>63243333</test18>
            </test6>
        </test5>
        <test11>
            <test12 ID="456436346">
                <test3>Pattern</test3>
                <test14>436346346</test14>
            </test12>
            <test12 ID="4364356">
                <test3> ID</test3>
                <test14>5674567457</test14>
            </test12>
            <test12 ID="123123123443">
                <test3>Other ID</test3>
                <test13>54234532452345</test13>
                <test14>231423532452345</test14>
            </test12>
        </test11>
        <test15>
            <test16 ID="34252345">
                <test3>None test</test3>
                <test14>456436436346</test14>
            </test16>
        </test15>
    </test2>
</test1>

更新那么完整的代码看起来像这样吗?

###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
    header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
    csvout = csv.DictWriter(fout, fieldnames=header)
    csvout.writeheader()
    row = {}
    for _, elem in ET.iterparse('file.xml'):
        # strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
        tag = re.sub("^{.*?}", "", elem.tag)
        if tag == 'test2':
            if len(row) != 0:
                print(row)
                csvout.writerow(row)
                row = {}
        if len(elem) == 0:
            text = elem.text
            old = row.get(tag)
            if old is None:
                # first occurrence of the tag
                row[tag] = text
            elif isinstance(old, str):
                # second occurrence of the tag
                row[tag] = [old, text]
            else:
                # already a list
                old.append(text)

最佳答案

对于嵌套 XML,您可以使用 iterparse() 函数来迭代 XML 中的所有元素。然后,您需要有逻辑来处理元素,具体取决于要添加到字典对象以导出为行的标签。

for _, elem in ET.iterparse('file.xml'):
    if len(elem) == 0:
        print(f'{elem.tag} {elem.attrib} text={elem.text}')
    else:
        print(f'{elem.tag} {elem.attrib}')

要从元素文本在 CSV 文件中创建一行,可以执行类似的操作。例如,如果“test2”标记新记录的开始,则可以使用它将该记录写入新行并清除下一条记录的字典。

如果想输出全部或部分属性,则需要为此添加几行代码。如果属性名称与元素名称具有相同的名称或多个元素具有相同的属性(例如 ID),则需要在代码中解决该问题。

import xml.etree.ElementTree as ET
import re
import csv

with open("out.csv", "w", newline='') as fout:
    header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
    csvout = csv.DictWriter(fout, fieldnames=header)
    csvout.writeheader()
    row = {}
    for _, elem in ET.iterparse('test.xml'):
        # strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
        tag = re.sub("^{.*?}", "", elem.tag)
        if tag == 'test2':
            if len(row) != 0:
                print(row)
                csvout.writerow(row)
                row = {}
        if len(elem) == 0:
            row[tag] = elem.text

输出:

{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}

CSV 输出:

test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment

更新:

如果想处理重复的标签并创建值列表,请尝试如下操作:

if len(elem) == 0:
    text = elem.text
    old = row.get(tag)
    if old is None:
        # first occurrence
        row[tag] = text
    elif isinstance(old, str):
        # second occurrence > create list
        row[tag] = [old, text]
    else:
        old.append(text)

关于python - 转换嵌套 XML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70599829/

相关文章:

perl - 如何使用 DBD::CSV 获取列名行?

node.js - Fast-CSV 异步流转换

python - 填充其他拖车数据框的值

python - 运行时警告 : coroutine was never awaited

javascript - 使用jQuery创建调查 - 任何示例?

xml - 使用 XSLT,当子元素包含 nil true 属性时删除父元素

python - 在 Python 中将换行符写入 csv

python - 在 Pandas Dataframe 上使用 groupby 按一列重新排列,其中仅另一列的最大值

python - reshape pandas 数据框的有效方法

xml - XSLT : Creating a Map in XSLT