python - 将 xml 文件嵌套到 pandas 数据框

标签 python xml pandas dataframe

我在解析 XML 文件以转换为 pandas 数据帧时遇到问题。示例条目如下:

<p>


 <persName id="t17200427-2-defend31" type="defendantName">
 Alice 
 Jones 
 <interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
 <interp inst="t17200427-2-defend31" type="given" value="Alice"/>
 <interp inst="t17200427-2-defend31" type="gender" value="female"/>
 </persName> 

 , of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName> 
 <interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
 <interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
 <join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
 <interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
 <interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
 privately stealing a Bermundas Hat, value 10 s. out of the Shop of 

 <persName id="t17200427-2-victim33" type="victimName">
 Edward 
 Hillior 
 <interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
 <interp inst="t17200427-2-victim33" type="given" value="Edward"/>
 <interp inst="t17200427-2-victim33" type="gender" value="male"/>
 <join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
 </persName> 



 </rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs> 
 <join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
 <interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
 <interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
 Guilty to the value of 10 d.
 </rs> 
 <rs id="t17200427-2-punish11" type="punishmentDescription">
 <interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
 <join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
 Transportation
 </rs> .</p>

我想要一个包含性别、罪行和审判文本列的数据框。我之前已将所有数据提取到数据框中,但无法获取

标记之间的文本。

这是示例代码:

def table_of_cases(xml_file_name):
    file = ET.ElementTree(file = xml_file_name)
    iterate = file.getiterator()
    i = 1
    table = pd.DataFrame()
    for element in iterate:
        if element.tag == "persName":
            t = element.attrib['type']
            try:
                val = [element.attrib['value']]
                if t not in labels:
                    table[t] = val
                elif t+num not in labels:
                    table[t+num] = val
                elif t+num in labels:
                    num = str(i+1)
                    table[t+num] = val
            except Exception:
                pass
            labels = list(table.columns.values)
            num = str(i)

    return table

** 我有大约 1,000 多个相同 XML 格式的文件要制作成一个数据帧

最佳答案

由于您的 XML 非常复杂,文本值跨节点溢出,因此请考虑 XSLT ,一种专用语言,旨在将特别复杂的 XML 文件转换为简单的文件。

Python的第三方模块,lxml ,可以运行 XSLT 1.0 甚至 XPath 1.0 来解析转换后的结果,以便迁移到 pandas数据框。此外,您可以使用外部 XSLT processors Python 可以通过 subprocess 调用.

具体来说,下面的 XSLT 使用 XPath 的 descendant::* 从被告和受害者以及整个段落文本值中提取必要的属性。从根开始,假设 <p>是它的一个 child 。

XSLT (另存为 .xsl 文件,特殊的 .xml 文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*">
    <xsl:apply-templates select="p"/>
  </xsl:template>

  <xsl:template match="p">
    <data>
      <defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
      <defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
      <offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
      <offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>

      <victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
      <victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
      <verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
      <verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
      <punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>

      <trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
    </data>
  </xsl:template>       

</xsl:stylesheet>

Python

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")

# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)

# OUTPUT TO CONSOLE
print(result)

data = []
for i in result.xpath('/*'):
    inner = {}
    for j in i.xpath('*'):
        inner[j.tag] = j.text

    data.append(inner)

trial_df = pd.DataFrame(data)

print(trial_df)

对于 1,000 个相似的 XML 文件,循环执行此过程,并将每个单行 Trial_df 数据帧附加到列表中,并使用 pd.concat 进行堆叠。 .

XML 输出

<?xml version="1.0"?>
<data>
  <defendantName>Alice Jones</defendantName>
  <defendantGender>female</defendantGender>
  <offenceCategory>theft</offenceCategory>
  <offenceSubCategory>shoplifting</offenceSubCategory>
  <victimName>Edward Hillior</victimName>
  <victimGender>male</victimGender>
  <verdictCategory>guilty</verdictCategory>
  <verdictSubCategory>theftunder1s</verdictSubCategory>
  <punishmentCategory>transport</punishmentCategory>
  <trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

数据帧输出

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0          female   Alice Jones           theft        shoplifting   

#   punishmentCategory                                          trialText  \
# 0          transport  Alice Jones , of St. Michael's Cornhill, was i...   

#   verdictCategory verdictSubCategory victimGender      victimName  
# 0          guilty       theftunder1s         male  Edward Hillior  

关于python - 将 xml 文件嵌套到 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49439081/

相关文章:

python - 下载Qiime到CentOS6.4。 python版本有问题

xml - Perl XML::DOM 解析器的使用

python - 对 1000 万对 1x20 向量执行余弦相似度的最快方法

具有非阻塞架构的 Python Web 服务器选项

python 和 excel 之间的 Python com

java - StAX - 是否可以将 XML 节点转换为 HashMap(或带有 hashmap 的 POJO 类)

xml - 使用clj-xpath在带有任意标签的clojure中解析xml

python - 使用 pandas 进行时间戳过滤

html - 在 HTML 中使用 pandas 数据框

python - 即使安装后也无法在ubuntu中的python-2.7.12中导入z3