python - 如何将可能格式错误的 xml 解析为数据框?

标签 python xml python-3.x pandas

我有来自 API 的 xml,看起来像这样。

import requests
import pandas as pd
import lxml.etree as et
from lxml import etree


 url = 'abc.com'

 xml_data1 = requests.get(url).content
 print(xml_data1)

xml_data1:

    <?xml version="1.0" encoding="utf-8"?>
    <Leads>
      <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">
    <Campaign CampaignId="123" CampaignTitle="abc" />
    <Status StatusId="123" StatusTitle="test" />
    <Agent AgentId="123" AgentName="test, test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4f2e0f2e612c2022" rel="noreferrer noopener nofollow">[email protected]</a>">
      <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />
    </Agent>
    <Fields>
      <Field FieldId="7" Value="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="472607266924282a" rel="noreferrer noopener nofollow">[email protected]</a>" FieldTitle="test" FieldType="test" />
      <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />
      <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />
      <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />
      <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />
      <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />
      <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1267" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1310" Value="test" FieldTitle="test" FieldType="Phone" />
      <Field FieldId="1319" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1485" Value="test" FieldTitle="tst" FieldType="State" />
    </Fields>
    <Logs>
      <StatusLog>
        <Status LogId="123" LogDate="01/04/2017 03:08:44" StatusId="28" StatusTitle="test" AgentId="19" AgentName="test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6f1b0a1c1b2f1b0a1c1b410c0002" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </StatusLog>
      <ActionLog>
        <Action LogId="123" ActionTypeId="73" ActionTypeName="test" MilestoneId="1" ActionDate="01/04/2017 03:08:44" ActionNote="test" AgentId="19" AgentName="test,test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="80f4e5f3f4c0f4e5f3f4aee3efed" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </ActionLog>
      <EmailLog>
        <Email LogId="123" SendDate="01/01/2017 20:53:39" EmailTemplateId="1" EmailTemplateName="test " AgentId="1" AgentName="test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="136776606753677660673d707c7e" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </EmailLog>
      <DistributionLog>
        <Distribution LogId="1" LogDate="01/01/2017 10:10:08" DistributionProgramId="1" DistributionProgramName="test" AssignedAgentId="1" AssignedAgentName="test,test" AssignedAgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="82f6e7f1f6c2f6e7f1f6ace1edef" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </DistributionLog>
      <CreationLog LogId="1" LogDate="01/01/2017 10:10:05" Imported="true" CreatedByAgentId="1" CreatedByAgentName="test, test" CreatedByAgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5f2b3a2c2b1f2b3a2c2b713c3032" rel="noreferrer noopener nofollow">[email protected]</a>" />
    </Logs>
  </Lead>
</Leads>

您是否关心工作,我无法发布整个 xml 字符串,但它遵循上面的结构。根据 xml 验证器,xml 是正确的,但是当我进行另一个 API 调用并返回不同的 xml 字符串时,它看起来像这样:

<?xml version="1.0" encoding="utf-8"?>\r\n<Leads>\r\n  <Lead Id="123" />\r\n  <Lead Id="456" />\r\n</Leads>'

我可以使用以下代码成功地将上述 xml 传递到数据帧中:

class XML2DataFrame:

    def __init__(self, xml_data):
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        """Return a list of dictionaries from the text
         and attributes of the children under this XML root."""
        return [self.parse_element(child) for child in iter(root)]

    def parse_element(self, element, parsed=None):
        """ Collect {key:attribute} and {tag:text} from thie XML
         element and all its children into a single dictionary of strings."""
        if parsed is None:
            parsed = dict()

        for key in element.keys():
            if key not in parsed:
                parsed[key] = element.attrib.get(key)
            else:
                raise ValueError('duplicate attribute {0} at element {1}'.format(key, element.getroottree().getpath(element)))           


        """ Apply recursion"""
        for child in list(element):
            self.parse_element(child, parsed)

        return parsed

    def process_data(self):
        """ Initiate the root XML, parse it, and return a dataframe"""
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

但是,当我将可能格式错误的 xml 字符串传递到上述函数时,我收到错误:

AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroottree'

由于可能存在格式错误的 xml 在同一标记中具有多个值,因此我认为该函数无法解析它。

我希望将可能格式错误的 xml 推送到平面数据框中。

从 xml 编辑输出行列标题:

 ActionCount           CreateDate Flagged      Id LastDistributionDate  LeadFormType                                   LeadTitle LogCount FieldId                 FieldTitle FieldType                          Value CampaignId  CampaignTitle  AgentEmail AgentId     AgentName              LogDate   LogId  StatusId       StatusTitle AssignedAgentEmail AssignedAgentId AssignedAgentName DistributionProgramId DistributionProgramName              LogDate   LogId  

最佳答案

既然您更新了问题,我决定用新的 xml 发布另一个答案。

from bs4 import BeautifulSoup 
import pandas as pd

xml = """
    <?xml version="1.0" encoding="utf-8"?>
    <Leads>
      <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">
    <Campaign CampaignId="123" CampaignTitle="abc" />
    <Status StatusId="123" StatusTitle="test" />
    <Agent AgentId="123" AgentName="test, test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9afbdafbb4f9f5f7" rel="noreferrer noopener nofollow">[email protected]</a>">
      <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />
    </Agent>
    <Fields>
      <Field FieldId="7" Value="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c7d5c7d327f7371" rel="noreferrer noopener nofollow">[email protected]</a>" FieldTitle="test" FieldType="test" />
      <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />
      <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />
      <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />
      <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />
      <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />
      <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1267" Value="test" FieldTitle="test" FieldType="Text" />
      <Field FieldId="1310" Value="test" FieldTitle="test" FieldType="Phone" />
      <Field FieldId="1319" Value="test" FieldTitle="test" FieldType="Number" />
      <Field FieldId="1485" Value="test" FieldTitle="tst" FieldType="State" />
    </Fields>
    <Logs>
      <StatusLog>
        <Status LogId="123" LogDate="01/04/2017 03:08:44" StatusId="28" StatusTitle="test" AgentId="19" AgentName="test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e490819790a490819790ca878b89" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </StatusLog>
      <ActionLog>
        <Action LogId="123" ActionTypeId="73" ActionTypeName="test" MilestoneId="1" ActionDate="01/04/2017 03:08:44" ActionNote="test" AgentId="19" AgentName="test,test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="196d7c6a6d596d7c6a6d377a7674" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </ActionLog>
      <EmailLog>
        <Email LogId="123" SendDate="01/01/2017 20:53:39" EmailTemplateId="1" EmailTemplateName="test " AgentId="1" AgentName="test" AgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="463223353206322335326825292b" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </EmailLog>
      <DistributionLog>
        <Distribution LogId="1" LogDate="01/01/2017 10:10:08" DistributionProgramId="1" DistributionProgramName="test" AssignedAgentId="1" AssignedAgentName="test,test" AssignedAgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="512534222511253422257f323e3c" rel="noreferrer noopener nofollow">[email protected]</a>" />
      </DistributionLog>
      <CreationLog LogId="1" LogDate="01/01/2017 10:10:05" Imported="true" CreatedByAgentId="1" CreatedByAgentName="test, test" CreatedByAgentEmail="<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="88fcedfbfcc8fcedfbfca6ebe7e5" rel="noreferrer noopener nofollow">[email protected]</a>" />
    </Logs>
  </Lead>
</Leads>
"""

soup = BeautifulSoup(xml, "xml")
# Get Attributes from all nodes
attrs = []
for elm in soup():  # soup() is equivalent to soup.find_all()
    attrs.append(elm.attrs)

# Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
fields_attribute_list= [x for x in attrs if 'FieldId' in x.keys()]
other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]

# Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
attribute_dict = {}
for d in other_attribute_list:
    for k, v in d.items():  
        attribute_dict.setdefault(k, v)

# Update each field row with attributes from all other nodes.
full_list = []
for field in fields_attribute_list:
    field.update(attribute_dict)
    full_list.append(field)

# Make Dataframe
df = pd.DataFrame(full_list)

但是,请注意,此方法会覆盖 xml 中同名的属性 ID,例如 LogId。无论如何,这段代码应该可以帮助您入门。

关于python - 如何将可能格式错误的 xml 解析为数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51843757/

相关文章:

python - 使用 pytest 报告单个函数的多个测试

python - 如何使用 tkinter 将小部件添加到标题栏?

python - 创建为日志消息添加前缀的记录器

java - 在jaxb中生成list标签和object标签

php - 加载数据本地内嵌文件

python - 在 Django 中创建 OneToMany 模型

xml - 在 R 中使用 XSLT 转换 XML

python - 有没有内置的方法可以在 Python 中使用内联 C 代码?

python - 如何按 Python 中字符串 asc/desc 的多个字段对对象列表进行排序?

python - Pandas 数据帧 : How to take the difference between observations with multiple observations per agent and stacked agents