我有一个包含分层树状结构的 XML 文档,请参见下面的示例。
文档包含几个<Message>
标签(为了方便,我只复制了其中一个)。
每个<Message>
有一些相关数据( id
、 status
、 priority
)。
此外,每个<Message>
可以包含一个或多个 <Street>
children 再次拥有一些相关数据(<name>
,<length>
)。
此外,每个 <Street>
可以有一个或多个<Link>
children 再次拥有自己的相关数据(<id>
,<direction>
)。
示例 XML 文档:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
<status>Active</status>
<priority>Low</priority>
<Area>
<Streets>
<Street>
<name>King Street</name>
<length>Short</length>
<Link>
<id>75838745</id>
<direction>North</direction>
</Link>
<Link>
<id>168745</id>
<direction>South</direction>
</Link>
<Link>
<id>975416</id>
<direction>North</direction>
</Link>
</Street>
<Street>
<name>Queen Street</name>
<length>Long</length>
<Link>
<id>366248</id>
<direction>West</direction>
</Link>
<Link>
<id>745812</id>
<direction>East</direction>
</Link>
</Street>
</Streets>
</Area>
</Message>
</Messages>
</Root>
用 Python 解析 XML 并将相关数据存储在变量中不是问题 - 我可以使用例如 lxml
库并阅读整个文档,然后执行一些 xpath
表达式来获取相关字段,或者使用 iterparse
逐行读取它方法。
但是,我想将数据放入 pandas 数据框中,同时保留其中的层次结构。目标是查询单个消息(例如,通过 if status == Active then get the Message with all its streets and its streets' links
之类的 bool 表达式)并获取属于特定消息(其街道及其街道的链接)的所有数据。如何最好地做到这一点?
我尝试了不同的方法,但都遇到了问题。
如果我为每个包含信息的 XML 行创建一个数据框行,然后在 [MessageID, StreetName, LinkID]
上设置一个 MultiIndex ,我得到一个包含很多 NaN
的索引在其中(通常不鼓励这样做)因为 MessageID
不知道它的 child streets
和 links
然而。此外,我不知道如何通过 bool 条件选择一些子数据集,而不是只获取一些没有其子项的单行。
在 [MessageID, StreetName, LinkID]
上执行 GroupBy 时,我不知道如何从 pandas GroupBy 对象取回(可能是 MultiIndex)数据帧,因为这里没有要聚合的内容(没有意思/标准/求和/无论如何,值应该保持不变)。
有什么可以有效处理的建议吗?
最佳答案
我终于设法解决了上述问题,这就是方法。
我扩展了上面给定的 XML 文档以包含两条消息而不是一条。这是它作为有效 Python 字符串的样子(当然也可以从文件中加载):
xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
<status>Active</status> \
<priority>Low</priority> \
<Area> \
<Streets> \
<Street> \
<name>King Street</name> \
<length>Short</length> \
<Link> \
<id>75838745</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>168745</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>975416</id> \
<direction>North</direction> \
</Link> \
</Street> \
<Street> \
<name>Queen Street</name> \
<length>Long</length> \
<Link> \
<id>366248</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>745812</id> \
<direction>East</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
<Message id='54321'> \
<status>Inactive</status> \
<priority>High</priority> \
<Area> \
<Streets> \
<Street> \
<name>Princess Street</name> \
<length>Mid</length> \
<Link> \
<id>744154</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>632214</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>654785</id> \
<direction>East</direction> \
</Link> \
</Street> \
<Street> \
<name>Prince Street</name> \
<length>Very Long</length> \
<Link> \
<id>1022444</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>4474558</id> \
<direction>South</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
</Messages> \
</Root>'''
为了将层次结构的 XML 结构解析为平面 pandas 数据框,我使用了 Python 的 ElementTree iterparse
方法,它提供了一个类似 SAX 的接口(interface)来逐行遍历 XML 文档,并在特定 XML 时触发事件标记开始或结束。
对于每个解析的 XML 行,给定的信息都存储在字典中。使用了三个字典,一个用于以某种方式属于一起的每组数据(消息、街道、链接),并且稍后将存储在它自己的数据帧行中。当收集到一个这样的行的所有信息时,字典将附加到一个列表中,该列表以适当的顺序存储所有行。
这是 XML 解析的样子(请参阅内联注释以获得进一步解释):
# imports
import xml.etree.ElementTree as ET
import pandas as pd
# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))
# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}
# initialize list that stores the single dataframe rows
listOfRows = []
# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):
##########
# get all information on the current message and store in the appropriate dictionary
##########
# get current message's id attribute
if event == 'start' and element.tag == 'Message':
messageRow = {} # re-initialize the dictionary for the current row
messageRow['messageId'] = element.get('id')
# get current message's status
if event == 'end' and element.tag == 'status':
messageRow['status'] = element.text
# get current message's priority
if event == 'end' and element.tag == 'priority':
messageRow['priority'] = element.text
# when no more information on the current message is expected, append it to the list of rows
if event == 'end' and element.tag == 'priority':
listOfRows.append(messageRow)
##########
# get all information on the current street and store in row dictionary
##########
if event == 'end' and element.tag == 'name':
streetRow = {} # re-initialize the dictionary for the current street row
streetRow['streetName'] = element.text
if event == 'end' and element.tag == 'length':
streetRow['streetLength'] = element.text
# when no more information on the current street is expected, append it to the list of rows
if event == 'end' and element.tag == 'length':
# link the street to the message it belongs to, then append
streetRow['messageId'] = messageRow['messageId']
listOfRows.append(streetRow)
##########
# get all information on the current link and store in row dictionary
##########
if event == 'end' and element.tag == 'id':
linkRow = {} # re-initialize the dictionary for the current link row
linkRow['linkId'] = element.text
if event == 'end' and element.tag == 'direction':
linkRow['direction'] = element.text
# when no more information on the current link is expected, append it to the list of rows
if event == 'end' and element.tag == 'direction':
# link the link to the message it belongs to, then append
linkRow['messageId'] = messageRow['messageId']
listOfRows.append(linkRow)
listOfRows
现在是一个字典列表,其中每个字典存储要放入一个数据帧行中的信息。可以使用此列表作为数据源创建数据框
# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)
并给出“原始”数据框:
messageId status priority streetName streetLength linkId \
0 12345 Active Low NaN NaN NaN
1 12345 NaN NaN King Street Short NaN
2 12345 NaN NaN NaN NaN 75838745
3 12345 NaN NaN NaN NaN 168745
4 12345 NaN NaN NaN NaN 975416
5 12345 NaN NaN Queen Street Long NaN
6 12345 NaN NaN NaN NaN 366248
7 12345 NaN NaN NaN NaN 745812
8 54321 Inactive High NaN NaN NaN
9 54321 NaN NaN Princess Street Mid NaN
10 54321 NaN NaN NaN NaN 744154
11 54321 NaN NaN NaN NaN 632214
12 54321 NaN NaN NaN NaN 654785
13 54321 NaN NaN Prince Street Very Long NaN
14 54321 NaN NaN NaN NaN 1022444
15 54321 NaN NaN NaN NaN 4474558
direction
0 NaN
1 NaN
2 North
3 South
4 North
5 NaN
6 West
7 East
8 NaN
9 NaN
10 West
11 South
12 East
13 NaN
14 North
15 South
我们现在可以将感兴趣的列(messageId、streetName、linkId)设置为该数据框上的 MultiIndex:
# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)
给出:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
尽管一般情况下应该忽略索引中的 NaN
,但对于这个用例,我认为它没有任何问题。
最后,为了获得通过 messageId
访问单个消息的预期效果,包括其所有“子”街道和链接,必须按最外层索引级别对 MultiIndexed 数据帧进行分组:
# group by the most outer index
groups = df.groupby(level='messageId')
现在,您可以使用
循环遍历所有消息(并对它们执行任何操作)# iterate over all groups
for key, group in groups:
print('key: ' + key)
print('group:')
print(group)
print('\n')
返回
key: 12345
group:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
key: 54321
group:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
或者您可以通过 messageId 访问特定消息,返回包含 messageId 的行及其所有专用街道和链接:
# get groups by key
print('specific group only:')
print(groups.get_group('54321'))
给予
specific group only:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
希望这对某些人有帮助。
关于python - 将分层(树状)XML 读入 Pandas 数据框,保留层次结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27503851/