您好,我可以将 xml 文件转换为 pandas 数据框。但我面临的挑战是我没有在正确的行中获取记录,假设我们在 xml 中有一组标签,例如,它会重复。 4次,它有多个子节点,应该是我的数据帧的列,现在当我尝试读取xml时,我只想在我的pandas数据帧中获取4行,但我得到了太多带有NaN的行,因为所有其他标签位于不同的水平面上。
编辑:刚刚弄清楚 XML 数据的描述/差异。提到的是最终编辑的xml数据 只是找出我的 XML 数据的一些问题...更新了正确的最终 xml 内容。
Same <ns1:parenttag> is getting repeated over a xml file multiple times
<?xml version="1.0" encoding="UTF-8"?>
<row:user-agents xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:row="http://www.row.com"
xmlns:ns1="http://www.ns1.com"
xmlns:ns2="http://www.ns2.com"
xmlns:ns3="http://www.ns3.com"
xmlns:row1="http://www.row1.com"
xmlns:row3="http://www.row3.com"
xmlns:row2="http://www.row2.com"
xsi:schemaLocation="http://www.schemaLocation-1.4.xsd">
<row:agent1>
<row:test>
<row2:test1>
<row2:test2>
<row2:test3>9999</row2:test3>
<row2:test4>aa</row2:test4>
<row2:test5>1</row2:test5>
</row2:test2>
</row2:test1>
<row2:test6>2017</row2:test6>
</row:test>
<row:agent2>
<row3:agent3>
<ns1:parenttag>
<ns1:childtag1>
<ns1:subchildtag1>
<ns1:indenticaltag>123</ns1:indenticaltag>
</ns1:subchildtag1>
</ns1:childtag1>
<ns1:indenticaltag>456</ns1:indenticaltag>
<ns1:childtag2>N</ns1:childtag2>
<ns1:childtag3>0</ns1:childtag3>
<ns1:childtag4>N</ns1:childtag4>
<ns1:childtag5>
<ns2:subchildtag2 attributname="abc">
<ns2:sub_subchildtag1>12 45</ns2:sub_subchildtag1>
</ns2:subchildtag2>
</ns1:childtag5>
<ns1:childtag6>tyu</ns1:childtag6>
<ns1:childtag7>2</ns1:childtag7>
<ns1:childtag8> poiu</ns1:childtag8>
<ns1:childtag9>
<ns3:subchildtag3>345</ns3:subchildtag3>
<ns3:subchildtag6>567</ns3:subchildtag6>
</ns1:childtag9>
<ns1:childtag10>N</ns1:childtag10>
<ns1:childtag11>
<ns3:subchildtag4>34</ns3:subchildtag4>
<ns3:subchildtag5>abc/123</ns3:subchildtag5>
</ns1:childtag11>
<ns1:childtag12>
<ns1:indenticaltag>234</ns1:indenticaltag>
</ns1:childtag12>
</ns1:parenttag>
</row3:agent3>
</row:agent2>
</row:agent1>
</row:user-agents>
另一个 XML 在父标签方面有点不同:
<ns1:parenttag>
<ns1:indenticaltag>123</ns1:indenticaltag>
<ns1:childtag2>N</ns1:childtag2>
<ns1:childtag3>0</ns1:childtag3>
<ns1:childtag4>N</ns1:childtag4>
<ns1:childtag5>
<ns2:subchildtag1 attributename0="poi">
<ns2:sub_subchildtag1>
<ns2:sub_sub_subchildtag1>
<ns2:sub_sub_sub_subchildtag1 attributename1="3" attributename2="17">1234</ns2:sub_sub_sub_subchildtag1>
</ns2:sub_sub_subchildtag1>
</ns2:sub_subchildtag1>
</ns2:subchildtag1>
</ns1:childtag5>
<ns1:childtag6>12</ns1:childtag6>
<ns1:childtag7> qwer</ns1:childtag7>
<ns1:childtag8>
<ns3:subchildtag2>456</ns3:subchildtag2>
</ns1:childtag8>
<ns1:childtag9>N</ns1:childtag9>
<ns1:childtag10>
<ns3:subchildtag3>908</ns3:subchildtag3>
<ns3:subchildtag4>abc/123</ns3:subchildtag4>
</ns1:childtag10>
</ns1:parenttag>
我现在正在使用 Parfait 在下面的答案中建议的功能: 但出现此错误:
i am getting ValueError: Length mismatch: Expected axis has 21 elements, new values have 22 elements erros
Also it has issue with indenticaltag column as its of same name thrice but hierarchy is different
but in dataframe i am getting only one indenticaltag column instead of 3 for example:
parent.child.indenticaltag, parent.child.subchild.indenticaltag and parent.child.subchild.sub_subchild.indenticaltag etc.
输出数据帧是:
I will parse both xmls differently using one function only.
Would like to parse all the tags and their attribute as column name in
pandas. Also the column name should be
parent.child.subchild.sub_sub_subchildtag and for attributes it should
be parent.child.subchild.sub_sub_childtag.attribute
他们有更好的方法来解析 xml 并以正确的格式获取记录吗?或者我错过了什么?
编辑:解决方案有效,但增加了一些复杂性
I need your help for three points if you guys can suggest some pointers:
1) I need columns name for pandas dataframe as root.child.subchild.grandchild i am not sure how i can get it here ? as in my solution i was able to get.
2) the descendant function is very slow is any way we can speed it up ?
3) i have to multiple xml of same type present in one directory and i would like to generate one dataframe out of it by appending all xml results any best way to do ?
最佳答案
考虑 xpath()
节点上的 lxml 的 <xs:topcol>
并使用 lxml 的 parse()
直接从文件读取。 XPath 循环迭代地附加到列表和字典容器以转换为数据帧。此外,您所需的输出实际上与节点值不对齐:
import pandas as pd
from lxml import etree
import re
pd.set_option('display.width', 1000)
NSMAP = {'row': 'http://www.row.com',
'row3': 'http://www.row3.com',
'row1': 'http://www.row1.com',
'xs': 'http://www.xs.com',
'row2': 'http://www.row2.com'}
xmldata = etree.parse('RowAgent.xml')
data = []
inner = {}
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):
for i in el: # PARSE CHILDREN
inner[i.tag] = i.text
if len(i.xpath('/*')) > 0: # PARSE GRANDCHILDREN
for subi in i:
inner[subi.tag] = subi.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
# REGEX TO REMOVE NAMESPACE URIs IN COL NAMES
df.columns = [re.sub(r'{.*}', '', col) for col in df.columns]
要解析无限的子元素,请使用 XPath 的 descendant::*
:
num_top_cols = len(xmldata.xpath('//xs:top_col', namespaces=NSMAP))
for i in range(1,num_top_cols+1):
for el in xmldata.xpath('//xs:top_col[{}]/descendant::*'.format(i), namespaces=NSMAP):
if el.text.strip()!='': # REMOVE EMPTY TEXT TAGS
inner[el.tag] = el.text.strip()
data.append(inner)
inner = {}
df = pd.DataFrame(data)
输出
print(df)
# col11_1 col11_2 col8_1 col8_2 col1 col10 col12 col13_1 col2 col3 col4 col5 col6 col7 col9
# 0 2010 AB 20/SEC001 2010 2016 00032000 test_name pqr 000330 N 0 3 N I AA N
# 1 2016026 rty-qwe-01 2000 26000 03985 temp2 perrl 0117203 N 0 3 N a 9AA N
# 2 8965 147A-254-044 7896 NaN 00985 mjkl rtyyu 45612 N 0 3 N NaN yuio N
# 3 52369 ui 247/mh45 145ghg7 NaN 78965 ghyuio trwer 9874 N 0 5 N NaN 23rt N
<小时/>
由于 descendants::*
的性能挑战,请考虑递归调用以首先遍历所有后代,然后再次调用以捕获数据帧列的父/子/孙名称。请确保现在使用 OrderedDict
:
from collections import OrderedDict
#... same as above XML setup ... #
def recursiveParse(curr_elem, curr_inner):
if len(curr_elem.xpath('/*')) > 0:
for child_elem in curr_elem:
curr_inner[child_elem.tag] = child_elem.text
inner[i.tag] = i.text
if child_elem.attrib is not None:
for attrib in child_elem.attrib:
inner[attrib] = child_elem.attrib[attrib]
recursiveParse(child_elem, curr_inner)
return(curr_inner)
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):
for i in el:
inner[i.tag] = i.text
if i.attrib is not None:
for attrib in i.attrib:
inner[attrib] = i.attrib[attrib]
recursiveParse(i, inner)
data.append(inner)
inner = {}
df = pd.DataFrame(data)
colnames = []
def recursiveNames(curr_elem, curr_inner, num):
if len(curr_elem.xpath('/*')) > 0:
for child_elem in curr_elem:
tmp = re.sub(r'{.*}', '', child_elem.tag)
curr_inner.append(colnames[num-1] +'.'+ tmp)
if child_elem.attrib is not None:
for attrib in child_elem.attrib:
curr_inner.append(curr_inner[len(curr_inner)-1] +'.'+ attrib)
recursiveNames(child_elem, curr_inner, len(colnames))
return(curr_inner)
for el in xmldata.xpath('//xs:top_col[1]', namespaces=NSMAP):
for i in el:
tmp = re.sub(r'{.*}', '', i.tag)
colnames.append(tmp)
recursiveNames(i, colnames, len(colnames))
df.columns = colnames
输出
print(df)
# col1 col2 col3 col4 col5 col6 col7 col8 col8.col8_1 col8.col8_1.sName col8.col8_2 col9 col10 col11 col11.col11_1 col11.col11_2 col12 col13 col13.col13_1
# 0 00032000 N 0 3 N I AA \n 2010 pqrst 2016 N test_name \n 2010 AB 20/SEC001 pqr \n 000330
# 1 03985 N 0 3 N a 9AA \n 2000 NaN 26000 N temp2 \n 2016026 rty-qwe-01 perrl \n 0117203
# 2 00985 N 0 3 N NaN yuio \n 7896 NaN NaN N mjkl \n 8965 147A-254-044 rtyyu \n 45612
# 3 78965 N 0 5 N NaN 23rt \n 145ghg7 NaN NaN N ghyuio \n 52369 ui 247/mh45 trwer \n 9874
最后,将此处理和原始 XML 解析集成到一个循环中,循环遍历目录中的所有 XML 文件。但是,请确保将所有数据帧保存在数据帧列表中,然后使用 pd.concat()
` 追加/堆栈。
import # modules
dfList = []
for f in os.list.dir('/path/to/XML/files'):
#...xml parse... (passing in f for file name in parse())
#...dataframe build with recursive calls...
dfList.append(df)
finaldf = pd.concat(dfList)
关于python - python Pandas 中的 XML 解析在一行中获取完整的标签 block ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45461399/