python - 使用 Pandas 从 xml url 读取单个节点

我正在尝试读取一个 XML 文件并访问一个特定属性，在本例中为 DonorAdvisedFundInd 属性，并使用它在 Pandas 中创建一个数据框。到目前为止，我已经尝试了以下代码:

import xml.etree.ElementTree as et
import requests
 
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
 
xtree = et.parse(xml_data)
xroot = xtree.getroot()
 
df_cols = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.attrib.get("DonorAdvisedFundInd")
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns=df_cols)
out_df

但我收到此错误消息:Errno 36: file name too long

我感谢任何人可以提供的任何反馈和替代建议。谢谢!

最佳答案

考虑新的 Pandas 1.3+ 方法，read_xml .事实上，在其 IO tools docs ，有一个检索 AWS S3 存储桶 IRS-990 XML 表单的示例，需要 s3fs 包。否则直接传递 URL 而无需 requests。

重要的是，由于 IRS 990 表格维护了一个默认命名空间，因此请在 XPath 查询中使用 namespaces 参数。注意:下面的 xpath 必须针对 DisplayName 节点的父节点进行调整，其中 DisplayName 及其兄弟节点作为数据框中的列迁移。

S3 路径

df = pd.read_xml(
    "s3://irs-form-990/201903199349320465_public.xml",
    xpath=".//irs:Parent_of_DisplayName",
    namespaces={"irs": "http://www.irs.gov/efile"}
)

Https 路径

df = pd.read_xml(
    "https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml",
    xpath=".//irs:Parent_of_DisplayName",
    namespaces={"doc": "http://s3.amazonaws.com/doc/2006-03-01/"}
)

关于python - 使用 Pandas 从 xml url 读取单个节点，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62734890/

python - 使用 Pandas 从 xml url 读取单个节点

上一篇：.net - VB.NET openFileDialog 崩溃

下一篇：版本 10 上的 Angular Material 日期选择器(日期范围功能)