python - 在 Python 中解析大型 XML 文件

标签 python xml python-3.x celementtree

需要一些帮助。我们有一个来自 API 的大型 XML 文件,我需要将其转换为 SQL Server 表。使用 pyodbc 将表加载到 SQL 中没有问题,但我正在努力解析 16mb 的 XML 文件。该文件格式正确 - 它包含在单个根目录中。几个星期以来,我一直在寻找解决方案,但一直无法通过反复试验来解决问题。任何帮助表示赞赏。

简而言之,目标是以表格格式输出 XML 文件中包含的所有值,其中每条记录都列在自己的行中。 XML 的地址和空间部分需要对子元素进行适当的处​​理。我有可用的代码,但我认为这不是解析值的“正确”方法:

import xml.etree.ElementTree as ET
tree = ET.parse('Z:\\FileLoc\\DealData.xml')
root = tree.getroot()
    for prop in deals[0]:
    for addr1 in prop[4].iter(tag="Street"):
        for addr2 in prop[4].iter(tag="City"):
            for addr3 in prop[4].iter(tag="State"):
                for addr4 in prop[4].iter(tag="PostalCode"):
                    for addr5 in prop[4].iter(tag="Country"):
                        for spaces in prop[5]:
                                print(deals.attrib.get("DealID"), prop.attrib.get("ID"),
                                   prop[0].text[0:5], prop[0].text[6:],
                                   addr1.text, addr2.text, addr3.text, addr4.text, addr5.text,
                                   spaces.attrib.get("ID"))

这是我正在使用的 XML 的一个示例:

    <?xml version="1.0" encoding="UTF-8"?>
<DealExport>
    <jobid>8558e207709c0730</jobid>
    <Deals>
        <Deal DealID="706880">
            <Properties>
                <Property ID="69320">
                    <Name>41381_BuildingName - Bldg 1</Name>
                    <CreatedAt>2015-10-28T04:09:07Z</CreatedAt>
                    <UpdatedAt>2017-12-20T01:09:51Z</UpdatedAt>
                    <Symbols>
                        <Type>SourceID</Type>
                        <Value>41381~41381</Value>
                    </Symbols>
                    <Address>
                        <City>Calgary</City>
                        <State>AB</State>
                        <Street>1234 Five Street</Street>
                        <PostalCode>H0H 0H0</PostalCode>
                        <Country>Canada</Country>
                    </Address>
                    <Spaces>
                        <Space ID="848294">
                            <Suite>800</Suite>
                            <SpaceAvailable>11225</SpaceAvailable>
                            <FloorName>8</FloorName>
                            <FloorPosition>7</FloorPosition>
                            <FloorRentableArea/>
                            <CreatedAt>2016-04-20T13:47:13Z</CreatedAt>
                            <UpdatedAt>2017-10-02T09:17:20Z</UpdatedAt>
                            <Symbols>
                                <Type>SourceID</Type>
                                <Value>800</Value>
                            </Symbols>
                        </Space>
                    </Spaces>
                </Property>
            </Properties>
            <DealType>new</DealType>
            <LastModified>2017-12-20T22:36:21Z</LastModified>
            <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
            <Stage>dead_deal</Stage>
            <Probability>0</Probability>
            <CompetitiveSet/>
            <MoveInDate>2017-12-01</MoveInDate>
            <Tenant>
                <Name>Current Occupant Ltd.</Name>
                <Industry>financial services</Industry>
                <Affiliation/>
                <TenantID>1013325</TenantID>
                <TenantWebsite/>
                <TenantCurrentLocation/>
                <TenantRSFFrom>6000</TenantRSFFrom>
                <TenantRSFTo>7000</TenantRSFTo>
            </Tenant>
            <Brokers></Brokers>
            <TenantContacts></TenantContacts>
            <DealStages>
                <DealStage ID="1533265">
                    <Stage>proposal</Stage>
                    <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
                    <LastModified>2017-06-06T15:09:22Z</LastModified>
                    <StartDate>2017-06-06T15:09:22Z</StartDate>
                    <DurationInDays>197</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>17699</UserId>
                        <FirstName>Person</FirstName>
                        <LastName>Last-Name</LastName>
                        <Email>employee@email.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174388">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:01Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:01Z</LastModified>
                    <StartDate>2017-12-20T22:36:01Z</StartDate>
                    <DurationInDays>0</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174390">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:21Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:21Z</LastModified>
                    <StartDate>2017-12-20T22:36:21Z</StartDate>
                    <DurationInDays>99</DurationInDays>
                    <DeadDealReasons>
                        <DeadDealReason>price</DeadDealReason>
                    </DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
            </DealStages>
            <DealComments></DealComments>
            <DealTerms>
                <DealTerm ID="4278580">
                    <ProposalType>landlord</ProposalType>
                    <DiscountRate>0.08</DiscountRate>
                    <RentableArea>6200</RentableArea>
                    <LeaseType>triple net</LeaseType>
                    <LeaseTerm>60</LeaseTerm>
                    <CommencementDate>2018-01-01</CommencementDate>
                    <TenantImprovements></TenantImprovements>
                    <BuildingImprovements></BuildingImprovements>
                    <FreeRents></FreeRents>
                    <CreatedDate>2017-06-06T15:16:23Z</CreatedDate>
                    <SecurityDeposit/>
                    <NER>20714.1144670763</NER>
                    <NEROverride/>
                    <DateEntered>2017-06-06</DateEntered>
                    <OpEx>
                        <BaseOpEx/>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </OpEx>
                    <RealEstateTaxes>
                        <BaseRealEstateTaxes>7.45</BaseRealEstateTaxes>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </RealEstateTaxes>
                    <NPVperSqFt>103367.15936951</NPVperSqFt>
                    <TotalNPV>640876388.09096</TotalNPV>
                    <MiscDescription/>
                    <NetCashFlow>642137720.0</NetCashFlow>
                    <TotalIncome>502200.0</TotalIncome>
                    <Concessions>64480.0</Concessions>
                    <Payback>1</Payback>
                    <IRR/>
                    <Latest>true</Latest>
                    <BaseRents>
                        <BaseRent ID="45937180">
                            <BeginIn>1</BeginIn>
                            <Rent>15.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>36</Duration>
                        </BaseRent>
                        <BaseRent ID="45937181">
                            <BeginIn>37</BeginIn>
                            <Rent>18.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>24</Duration>
                        </BaseRent>
                    </BaseRents>
                    <RentEscalations></RentEscalations>
                    <OtherCredits></OtherCredits>
                    <Commissions>
                        <Commission ID="260593">
                            <RepType>landlord</RepType>
                            <Firm>LandLord</Firm>
                            <Amount>2.9</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                        <Commission ID="260594">
                            <RepType>landlord</RepType>
                            <Firm>Payee</Firm>
                            <Amount>7.5</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                    </Commissions>
                    <Rights></Rights>
                    <ProposalID>186607</ProposalID>
                    <ProposalEnteredDate>2017-06-06</ProposalEnteredDate>
                    <RecoveryIncomes>
                        <RecoveryIncome>
                            <RecoveryType>ReimbursableExpenseCam</RecoveryType>
                            <ExpenseAmount>12.87</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>12.87</RecoveryAmount>
                        </RecoveryIncome>
                        <RecoveryIncome>
                            <RecoveryType>RealEstateTax</RecoveryType>
                            <ExpenseAmount>7.45</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>7.45</RecoveryAmount>
                        </RecoveryIncome>
                    </RecoveryIncomes>
                </DealTerm>
            </DealTerms>
            <Budgets></Budgets>
            <Appraisals></Appraisals>
            <Tours></Tours>
        </Deal>

最佳答案

据我所知,使用 XML 格式的大文件没有问题。您可以将 xml.etree.ElementTree 用于您的数据。我测试了以下代码并且它有效(解析和打印):

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def deepParsing(node, pre=None):
    if pre is None:
        pre = '-'
    print(' ')
    print(pre + '> Tag:',node.tag)
    print(pre + '> Attributes:', node.attrib)
    print(pre + '> Text:', node.text)

    pre += '-'
    for key in node:
        deepParsing(key, pre)
deepParsing(root)

您的数据已根据给定的树打印在控制台上。

您的示例缺少结束标记:

    </Deals>
</DealExport>

在 xml 的末尾。

关于python - 在 Python 中解析大型 XML 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49597947/

相关文章:

python-3.x - Python 3 中的字符串替换?

python - 如何显示字典中的项目列表?

python - 将通过 PIL 创建的图像保存到 django 模型

php - 我可以做什么来更好地构建我的文件目录?

java - 用于基于表单的身份验证的 Tomcat 7 领域配置

python - 从 CSV 文件中删除双引号

python - 使用 pandas 数据框绘制热图

python - 尝试提供全局日志记录功能

php - simplexmlelement 上的 json_decode/encode 添加数组而不是空字符串

python - 在 Python 中使用 lambda 时遇到问题