R Dataframe - 在时间序列中应用表达式,并将结果输出到新的数据帧中

标签 r dataframe

我正在学习 R,并且遇到了一个我无法克服/找到答案的问题。

我有一个数据框

  ID=c("a1","a1","a1","a1", 
       "a2","a2","a2","a2",
       "a3","a3","a3","a3",
       "b1","b1","b1","b1", 
       "b2","b2","b2","b2",
       "b3","b3","b3","b3"), 
  Date=c("January-19", "February-19", "March-19", "April-19", 
         "January-19", "February-19", "March-19", "April-19",
         "January-19", "February-19", "March-19", "April-19", 
         "January-19", "February-19", "March-19", "April-19", 
         "January-19", "February-19", "March-19", "April-19", 
         "January-19", "February-19", "March-19", "April-19", 
         "May-19", "June-19", "July-19", "August-19", 
         "May-19", "June-19", "July-19", "August-19",
         "May-19", "June-19", "July-19", "August-19", 
         "May-19", "June-19", "July-19", "August-19",
         "May-19", "June-19", "July-19", "August-19", 
         "May-19", "June-19", "July-19", "August-19"), 
  Value=c(1,2,5,4,7,3,9,8,9,10,44,3,15,16,17,2, 3, 22, 12, 3, 4, 44, 24, 5))

“ID”列是“字符”,“日期”列是“日期”,“值”列是“数字”。

基于此数据框(df),我尝试创建一个新的数据框,它将在一列中显示表达式的结果,以及它在另一列中引用的日期。

例如对于“df”中的给定日期,我想找到给定表达式“(a1 + b1)/b1”的“值”,并将结果放入新的数据框中,显示该日期期间的单个值指的是并应用于“日期”时间序列。

使用“df”值和示例表达式,新数据框将如下所示:

January-19  | 1.06
February-19 | 1.13
March-19    | 1.29 
April-19    | 3
May-19      | 1.06
June-19     | 1.13
July-19     | 1.29

这些表达式比给出的示例要复杂得多,但我不确定这是否重要,因为我试图找出的是如何应用任何计算并将其针对新的一系列日期输出数据框 - 无论复杂性如何。

如果这是一个简单的问题,我们深表歉意,并提前感谢您。

最佳答案

这是一个适用于所有 ID 集的基本 R 解决方案。这也假设条目之间是对称的。

重要的一步是将数据调整为正确的顺序。后续步骤仅处理条目。

使用这种方法的好处是可扩展的执行时间、对数据的最大程度的控制以及包独立性(这是个人偏好)。

数据:

df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L), class = "factor", .Label = c("a1",
"a2", "a3", "b1", "b2", "b3")), Date = structure(c(4L, 3L, 7L,
1L, 4L, 3L, 7L, 1L, 4L, 3L, 7L, 1L, 4L, 3L, 7L, 1L, 4L, 3L, 7L,
1L, 4L, 3L, 7L, 1L, 8L, 6L, 5L, 2L, 8L, 6L, 5L, 2L, 8L, 6L, 5L,
2L, 8L, 6L, 5L, 2L, 8L, 6L, 5L, 2L, 8L, 6L, 5L, 2L), .Label = c("April-19",
"August-19", "February-19", "January-19", "July-19", "June-19",
"March-19", "May-19"), class = "factor"), Value = c(1, 2, 5,
4, 7, 3, 9, 8, 9, 10, 44, 3, 15, 16, 17, 2, 3, 22, 12, 3, 4,
44, 24, 5, 1, 2, 5, 4, 7, 3, 9, 8, 9, 10, 44, 3, 15, 16, 17,
2, 3, 22, 12, 3, 4, 44, 24, 5)), class = "data.frame", row.names = c(NA,
-48L))

首先,重新排序数据框:

df_reo <- df[ order( matrix( unlist( strsplit( as.character(df$ID), "" ) ),
                             ncol=2, byrow=T )[,2],
                     as.Date(df$Date, "%b-%d") ), ]

设置辅助变量:

li <- matrix( 1:nrow(df_reo), ncol=2, byrow=T ) # helper ids for the rows
colnames(li) <- c("a","b")

ds <- as.numeric( unlist(strsplit(sort(as.character( df$ID )), "" )[nrow(df)])[2] ) # ID-sets, only for nicer formatting

然后进行计算:

df_fin <- matrix( vapply( 1:nrow(li), function(x){
                        ( df_reo$Value[li[x,"a"]] + df_reo$Value[li[x,"b"]] ) / 
                          df_reo$Value[li[x,"b"]] }, 1.0 ), ncol=ds ) 

rownames(df_fin) <- unique(df_reo$Date)
> data.frame( df_fin )
                  X1       X2       X3
January-19  1.066667 3.333333 3.250000
February-19 1.125000 1.136364 1.227273
March-19    1.294118 1.750000 2.833333
April-19    3.000000 3.666667 1.600000
May-19      1.066667 3.333333 3.250000
June-19     1.125000 1.136364 1.227273
July-19     1.294118 1.750000 2.833333
August-19   3.000000 3.666667 1.600000

关于R Dataframe - 在时间序列中应用表达式,并将结果输出到新的数据帧中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66286961/

相关文章:

python - 替换 Pandas 数据框中的字符串

r - 循环 tsCV 中的预测函数

r - 当从 R 中的 zip 文件读取数据时,它会损坏之前读入的数据

R - 在数据框中查找所有序列及其频率

r - 使用基数R遍历多个列表

Python:从其他列中选择的列获取值

r - 如何在r中获取数据框的比例和计数

python - pandas如何根据df中的其他 bool 列创建 bool 列

python - 在 Python 中跨多列应用 str.contains 时出现问题

python - 比较两个数据框中的两列(字符串格式),而列的长度不同