xml - R-XML 将节点拉入矩阵/DF 中以解决缺失节点的问题

标签 xml r xpath xml-parsing

我对使用 R 相当陌生,对使用 XML 包和 xpath 也很陌生。我需要从如下所示的 xml 文件中提取四个元素(除了我已经修剪掉许多其他 xmlnode 来简化这里):

<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
  <iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
  <period-start iso-date="2010-01-01"></period-start>
  <period-end iso-date="2013-02-28"></period-end>
</budget>
  </iati-activity>
  <iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-04-20"></period-start>
  <period-end iso-date="2012-10-02"></period-end>
  <value value-date="2010-04-20">12500</value>
</budget>
   </iati-activity>
  <iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
    <iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-10-01"></period-start>
  <period-end iso-date="2011-12-31"></period-end>
  <value value-date="2010-10-01">227000</value>
</budget>
  </iati-activity>
</iati-activities>

这也是我在 StackOverflow 上遇到的第一个问题,所以如果我没有正确地回答(并且 xml 没有完全对齐),我深表歉意。 我需要的元素以及我将它们分配给的元素是:

UniqueID <- "//iati-activity/iati-identifier"

GrantTitle <- "//iati-activity/title"

GrantAmount <- "//iati-activity/budget/value"

Recipient <- "//iati-activity/participatingorg[@role='Implementing']"

到目前为止(经过多次尝试和磨难)我已经想出了这段代码,它遍历当前节点(x),拉动4个变量,并将它们绑定(bind)成一行,然后使用xpathApply循环iati-事件节点调用该函数并将结果行绑定(bind)在一起。

当每个事件中都存在所有四个元素时,此代码将起作用。但是,请注意 xml 示例中缺少预算/值(value)节点。这是因为我删除它是为了解决丢失节点的问题,对于我需要的几乎所有元素来说,这种情况在完整文件中经常发生。

另请注意 xpath 表达式末尾的 [1] - 我之所以将其包含在内,是因为还有多个标题、所有类型的多个参与组织等。

鉴于某些元素的倍数和其他元素的不存在,不可能简单地将所有相同元素拉入向量并将其弹出到数据框中。因此需要循环遍历每个事件来拉动元素。我的代码目前无法解释缺失的元素(第一个 iati 事件中缺失的预算/值),因为 cbinding (和 rbinding)忽略空向量。

xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
  UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
  GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
  GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
  Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
  cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
  }
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA

如何保留空值/缺失节点,以便将其转换为如下所示的矩阵或数据框:

    UniqueID    GrantTitle  GrantAmount Recipient
1   NL-KVK-41160054-100530  Improvement of basic health care    NA  CORDAID RCA
2   NL-KVK-41160054-100625  Pigs for Pencils    12500   PREDA Foundation Inc.
3   NL-KVK-41160054-100815  Job and housing opportunities for women     227000  WISE

因为我还是个新手,所以代码越简单越好。提前致谢!

最佳答案

如果您的 xpath 查询返回太多或很少的结果,我认为使用节点会更容易

doc <- xmlParse( '<your xml here>')
nodes<- getNodeSet(doc, "//iati-activity")

#Compare
xpathSApply(doc, "//budget/value", xmlValue)
xpathSApply(doc, "//participating-org[@role='Funding']", xmlValue)

sapply(nodes, function(x) xpathSApply(x, "./budget/value", xmlValue))
sapply(nodes, function(x) xpathSApply(x, "./participating-org[@role='Funding']", xmlValue))

添加一个函数来处理丢失或多个节点,然后创建 data.frame

xpath2 <-function(x, path, fun = xmlValue, ...){
   y <- xpathSApply(x, path, fun, ...)
   ifelse(length(y) == 0, NA,
    ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}

GrantAmount <- sapply(nodes, xpath2, "./budget/value")
UniqueID    <- sapply(nodes, xpath2, "./iati-identifier")
GrantTitle  <- sapply(nodes, xpath2, "./title")
Recipient   <-  sapply(nodes, xpath2, "./participating-org[@role='Implementing']")
## updated xpath2 so xmlGetAttr will also work
Funding_ref  <- sapply(nodes, xpath2, "./participating-org[@role='Funding']", xmlGetAttr, "ref")
Budget_start <- sapply(nodes, xpath2, ".//period-start", xmlGetAttr, "iso-date")

data.frame(UniqueID, GrantTitle, GrantAmount, Recipient)
                UniqueID                               GrantTitle GrantAmount             Recipient
1 NL-KVK-41160054-100530         Improvement of basic health care        <NA>           CORDAID RCA
2 NL-KVK-41160054-100625                         Pigs for Pencils       12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women       227000                  WISE

关于xml - R-XML 将节点拉入矩阵/DF 中以解决缺失节点的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31999486/

相关文章:

php - 在水平 HTML 表格中插入 foreach 输出

android - 请参阅下面 XML 布局中的 ID

java - 如何在java中使用xslt从xml中选择使用group-by

xml - 用于从 XML 文件中选择不同值的 XPath

java - 如何使用Xpath java仅在 sibling 中查找节点?

java - 对元素的 DOM 属性序列进行排序

xml - python xml 错误

r - 在 ggplot2 的条形图中为分类变量添加阴影替代区域

r - 当我使用 R 导出表格时,gt() 有没有办法删除表格周围奇怪的白色边框?

r - 如何通过变量创建新列?