我在解析 XML 文件以转换为 pandas 数据帧时遇到问题。示例条目如下:
<p>
<persName id="t17200427-2-defend31" type="defendantName">
Alice
Jones
<interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
<interp inst="t17200427-2-defend31" type="given" value="Alice"/>
<interp inst="t17200427-2-defend31" type="gender" value="female"/>
</persName>
, of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName>
<interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
<interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
<join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
<interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
<interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
privately stealing a Bermundas Hat, value 10 s. out of the Shop of
<persName id="t17200427-2-victim33" type="victimName">
Edward
Hillior
<interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
<interp inst="t17200427-2-victim33" type="given" value="Edward"/>
<interp inst="t17200427-2-victim33" type="gender" value="male"/>
<join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
</persName>
</rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs>
<join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
<interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
<interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
Guilty to the value of 10 d.
</rs>
<rs id="t17200427-2-punish11" type="punishmentDescription">
<interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
<join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
Transportation
</rs> .</p>
我想要一个包含性别、罪行和审判文本列的数据框。我之前已将所有数据提取到数据框中,但无法获取
标记之间的文本。
这是示例代码:
def table_of_cases(xml_file_name):
file = ET.ElementTree(file = xml_file_name)
iterate = file.getiterator()
i = 1
table = pd.DataFrame()
for element in iterate:
if element.tag == "persName":
t = element.attrib['type']
try:
val = [element.attrib['value']]
if t not in labels:
table[t] = val
elif t+num not in labels:
table[t+num] = val
elif t+num in labels:
num = str(i+1)
table[t+num] = val
except Exception:
pass
labels = list(table.columns.values)
num = str(i)
return table
** 我有大约 1,000 多个相同 XML 格式的文件要制作成一个数据帧
最佳答案
由于您的 XML 非常复杂,文本值跨节点溢出,因此请考虑 XSLT ,一种专用语言,旨在将特别复杂的 XML 文件转换为简单的文件。
Python的第三方模块,lxml
,可以运行 XSLT 1.0 甚至 XPath 1.0 来解析转换后的结果,以便迁移到 pandas
数据框。此外,您可以使用外部 XSLT processors Python 可以通过 subprocess
调用.
具体来说,下面的 XSLT 使用 XPath 的 descendant::*
从被告和受害者以及整个段落文本值中提取必要的属性。从根开始,假设 <p>
是它的一个 child 。
XSLT (另存为 .xsl 文件,特殊的 .xml 文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
<xsl:apply-templates select="p"/>
</xsl:template>
<xsl:template match="p">
<data>
<defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
<defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
<offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
<offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>
<victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
<victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
<verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
<verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
<punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>
<trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
</data>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")
# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)
# OUTPUT TO CONSOLE
print(result)
data = []
for i in result.xpath('/*'):
inner = {}
for j in i.xpath('*'):
inner[j.tag] = j.text
data.append(inner)
trial_df = pd.DataFrame(data)
print(trial_df)
对于 1,000 个相似的 XML 文件,循环执行此过程,并将每个单行 Trial_df 数据帧附加到列表中,并使用 pd.concat
进行堆叠。 .
XML 输出
<?xml version="1.0"?>
<data>
<defendantName>Alice Jones</defendantName>
<defendantGender>female</defendantGender>
<offenceCategory>theft</offenceCategory>
<offenceSubCategory>shoplifting</offenceSubCategory>
<victimName>Edward Hillior</victimName>
<victimGender>male</victimGender>
<verdictCategory>guilty</verdictCategory>
<verdictSubCategory>theftunder1s</verdictSubCategory>
<punishmentCategory>transport</punishmentCategory>
<trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>
数据帧输出
# defendantGender defendantName offenceCategory offenceSubCategory \
# 0 female Alice Jones theft shoplifting
# punishmentCategory trialText \
# 0 transport Alice Jones , of St. Michael's Cornhill, was i...
# verdictCategory verdictSubCategory victimGender victimName
# 0 guilty theftunder1s male Edward Hillior
关于python - 将 xml 文件嵌套到 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49439081/