python - 更有效地将 xml 文件转换为数据帧

我正在尝试将一个大型 (53MB) XML 文件加载到 pandas 数据框中。这里有 3 行实际数据(来自 NTSB 航空事故报告的公共(public)数据库)，但实际文件有 77257 行:

<?xml version="1.0"?>
<DATA xmlns="http://www.ntsb.gov">
<ROWS>
    <ROW EventId="20150901X74304" InvestigationType="Accident" AccidentNumber="GAA15CA244" EventDate="09/01/2015" Location="Truckee, CA" Country="United States" Latitude="" Longitude="" AirportCode="" AirportName="" InjurySeverity="" AircraftDamage="" AircraftCategory="" RegistrationNumber="N786AB" Make="JOE SALOMONE" Model="SUPER CUB SQ2" AmateurBuilt="" NumberOfEngines="" EngineType="" FARDescription="" Schedule="" PurposeOfFlight="" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="" ReportStatus="Preliminary" PublicationDate=""/>
    <ROW EventId="20150901X92332" InvestigationType="Accident" AccidentNumber="CEN15LA392" EventDate="08/31/2015" Location="Houston, TX" Country="United States" Latitude="29.809444" Longitude="-95.668889" AirportCode="IWS" AirportName="WEST HOUSTON" InjurySeverity="Non-Fatal" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N452CS" Make="CESSNA" Model="T240" AmateurBuilt="No" NumberOfEngines="" EngineType="" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Instructional" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="2" WeatherCondition="VMC" BroadPhaseOfFlight="LANDING" ReportStatus="Preliminary" PublicationDate="09/04/2015"/>
    <ROW EventId="20150729X33718" InvestigationType="Accident" AccidentNumber="CEN15FA325" EventDate="" Location="Truth or Consequences, NM" Country="United States" Latitude="33.250556" Longitude="-107.293611" AirportCode="TCS" AirportName="TRUTH OR CONSEQUENCES MUNI" InjurySeverity="Fatal(2)" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N32401" Make="PIPER" Model="PA-28-151" AmateurBuilt="No" NumberOfEngines="1" EngineType="Reciprocating" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Personal" AirCarrier="" TotalFatalInjuries="2" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="UNKNOWN" ReportStatus="Preliminary" PublicationDate="08/10/2015"/>
</ROWS>
</DATA>

以下代码，我改编自here ，有效，但此数据非常慢(在我的系统上超过 30 分钟)。我似乎无法为原始示例发布解决方案，因为我的 XML 结构不同。有没有更有效的方法来加载这些数据？

path_to_xml_file = mypath

import pandas as pd
import xml.etree.ElementTree as ET

#Load xml file data
tree = ET.parse(path_to_xml_file)
root = tree.getroot()

#Grab list of column names
aviationdata_column_names = root[0][0].attrib.keys()             
#Create empty dataframe   
aviationdata_df = pd.DataFrame(columns=aviationdata_column_names)

#Loop through tree and append to dataframe
for i in range(0,len(root[0])-1):
    new_row = pd.Series(root[0][i].attrib)
    new_row.name = i
    aviationdata_df = aviationdata_df.append(new_row)

互联网上发布了针对类似问题的各种解决方案(here、here 和 here)，但我没有运气实现它们。版本问题可能是部分原因(我使用的是 Python 2.7)。

最佳答案

由于您的 XML 是以属性为中心的(无元素值)，请考虑遍历存储在 xml.etree.ElementTree 中的字典键/值对中的所有属性。

下面将属性集字典列表绑定(bind)到 DataFrame() 调用:

import pandas as pd
import xml.etree.ElementTree as ET

path_to_xml_file = mypath

# Load xml file data
tree = ET.parse(path_to_xml_file)

data = []
for el in tree.iterfind('./*'):
    for i in el.iterfind('*'):
        data.append(dict(i.items()))

df = pd.DataFrame(data)

输出

# FIRST FEW COLUMNS
print(df[list(range(12))])

#   AccidentNumber AirCarrier AircraftCategory AircraftDamage AirportCode                 AirportName AmateurBuilt BroadPhaseOfFlight        Country     EngineType   EventDate         EventId
# 0     GAA15CA244                                                                                                                     United States                 09/01/2015  20150901X74304
# 1     CEN15LA392                    Airplane    Substantial         IWS                WEST HOUSTON           No            LANDING  United States                 08/31/2015  20150901X92332
# 2     CEN15FA325                    Airplane    Substantial         TCS  TRUTH OR CONSEQUENCES MUNI           No            UNKNOWN  United States  Reciprocating              20150729X33718

关于python - 更有效地将 xml 文件转换为数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41795198/

python - 更有效地将 xml 文件转换为数据帧

上一篇：python - 优化 Python 代码

下一篇：python - 有没有办法在Python中模拟文件下载？