r - 使用 dplyr 和 data.table 按组区分

标签 r data.table dplyr

我想按组计算差异。虽然我在 SO 上提到了 R: Function “diff” over various groups 线程,但由于未知原因,我无法找到区别。我尝试了三种方法:a) spread b) dplyr::mutate with base::diff() c) data.表base::diff()。虽然 a) 有效,但我不确定如何使用 b) 和 c) 解决此问题。

数据说明: 我有该产品每年的收入数据。我将 >= 2013 年分类为周期 2(称为 P2),将 <2013 年分类为周期 1(称为 P1)。

示例数据:

dput(Test_File)
structure(list(Ship_Date = c(2010, 2010, 2012, 2012, 2012, 2012, 
2017, 2017, 2017, 2016, 2016, 2016, 2011, 2017), Name = c("Apple", 
"Apple", "Banana", "Banana", "Banana", "Banana", "Apple", "Apple", 
"Apple", "Banana", "Banana", "Banana", "Mango", "Pineapple"), 
    Revenue = c(5, 10, 13, 14, 15, 16, 25, 25, 25, 1, 2, 4, 5, 
    7)), .Names = c("Ship_Date", "Name", "Revenue"), row.names = c(NA, 
14L), class = "data.frame")

预期输出

dput(Diff_Table)
structure(list(Name = c("Apple", "Banana", "Mango", "Pineapple"
), P1 = c(15, 58, 5, NA), P2 = c(75, 7, NA, 7), Diff = c(60, 
-51, NA, NA)), .Names = c("Name", "P1", "P2", "Diff"), class = "data.frame", row.names = c(NA, 
-4L))

这是我的代码:

方法 1:使用 spread [有效]

data.table::setDT(Test_File)
cutoff<-2013
Test_File[Test_File$Ship_Date>=cutoff,"Ship_Period"]<-"P2"
Test_File[Test_File$Ship_Date<cutoff,"Ship_Period"]<-"P1"

Diff_Table<- Test_File %>%
  dplyr::group_by(Ship_Period,Name) %>%
  dplyr::mutate(Revenue = sum(Revenue)) %>%
  dplyr::select(Ship_Period, Name,Revenue) %>%
  dplyr::ungroup() %>%
  dplyr::distinct() %>%
  tidyr::spread(key = Ship_Period,value = Revenue) %>% 
  dplyr::mutate(Diff = `P2` - `P1`)

方法 2:使用 dplyr [不起作用:NA 在 Diff 列中生成。]

Diff_Table<- Test_File %>%
  dplyr::group_by(Ship_Period,Name) %>%
  dplyr::mutate(Revenue = sum(Revenue)) %>%
  dplyr::select(Ship_Period, Name,Revenue) %>%
  dplyr::ungroup() %>%
  dplyr::distinct() %>%
  dplyr::arrange(Name,Ship_Period, Revenue) %>%
  dplyr::group_by(Ship_Period,Name) %>%
  dplyr::mutate(Diff = diff(Revenue))

方法 3:使用 data.table [不起作用:它在 Diff 列中生成所有零。]

Test_File[,Revenue1 := sum(Revenue),by=c("Ship_Period","Name")]
Diff_Table<-Test_File[,.(Diff = diff(Revenue1)),by=c("Ship_Period","Name")]

问题:有人可以帮我解决上面的方法 2 和方法 3 吗?我是 R 的新手,所以如果我的工作听起来太基础,我深表歉意。我仍在学习这门语言。

最佳答案

我们可以用 data.table 来做到这一点。将 'data.frame' 转换为 'data.table' (setDT(Test_File)),按 'Name' 和 'Name' 的 run-length-id 分组,得到 sum 的 'Revenue',使用 dcast 将其 reshape 为 'wide' 格式,获取 'P2' 和 'P1' 之间的差异并分配 (:=)它到“差异”

library(data.table)
dcast(setDT(Test_File)[, .(Revenue = sum(Revenue)),
   .(grp=rleid(Name), Name)], Name~ paste0("P", rowid(Name)), 
        value.var = "Revenue")[, Diff := P2 - P1][]
#        Name P1 P2 Diff
#1:     Apple 15 75   60
#2:    Banana 58  7  -51
#3:     Mango  5 NA   NA
#4: Pineapple  7 NA   NA

或者对于第三种情况,即base R,我们根据'Name'中的相邻元素是否相同('grp')创建一个分组列,然后聚合 'Revenue' 通过 'Name' 和 'grp' 找到 sum,创建一个序列列,reshape 到 'wide' 和 转换数据集以创建“差异”列

Test_File$grp <- with(Test_File, cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])))
d1 <- aggregate(Revenue~Name +grp, Test_File, sum)
d1$Seq <- with(d1, ave(seq_along(Name), Name, FUN = seq_along))
transform(reshape(d1[-2], idvar = "Name", timevar = "Seq", 
            direction = "wide"), Diff = Revenue.2- Revenue.1)

tidyverse 方法也可以使用

library(dplyr)
library(tidyr)
Test_File %>% 
       group_by(grp = cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])), Name)  %>%
       summarise(Revenue = sum(Revenue)) %>%
       group_by(Name) %>% 
       mutate(Seq = paste0("P", row_number()))  %>% 
       select(-grp) %>% 
       spread(Seq, Revenue) %>% 
       mutate(Diff = P2-P1)
 #Source: local data frame [4 x 4]
 #Groups: Name [4]

#      Name    P1    P2  Diff
#      <chr> <dbl> <dbl> <dbl>
#1     Apple    15    75    60
#2    Banana    58     7   -51
#3     Mango     5    NA    NA
#4 Pineapple     7    NA    NA

更新

根据 OP 的评论仅使用 diff 函数

library(data.table)
setDT(Test_File)[, .(Revenue = sum(Revenue)), .(Name, grp = rleid(Name))
  ][, .(P1 = Revenue[1L], P2 = Revenue[2L], Diff = diff(Revenue)) , Name]
#        Name P1 P2 Diff
#1:     Apple 15 75   60
#2:    Banana 58  7  -51
#3:     Mango  5 NA   NA
#4: Pineapple  7 NA   NA

或者用dplyr

Test_File %>% 
   group_by(grp = cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])), Name)  %>%
   summarise(Revenue = sum(Revenue)) %>%
   group_by(Name) %>% 
   summarise(P1 = first(Revenue), P2 = last(Revenue)) %>%
   mutate(Diff = P2-P1)

关于r - 使用 dplyr 和 data.table 按组区分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43627015/

相关文章:

r - 控制ggplot2图例显示顺序

r - 在 R 中循环获取 ChangePoint 数据

R data.table 1.9.2 关于 setkey 的问题

r - data.table:如何根据包含列名的分组唯一行值更改列值

r - 使用 tidyverse 到 "unnest"小标题内的 data.frame 列

r - 将逗号分隔的字符串转换为数字列

用于RGBA到HEX颜色转换的R函数

r - 使用data.tables,尝试按列索引聚合数据

python - 在 pandas 数据帧上链接方法时,列引用语法看似不一致

r - 在 R 中使用 dplyr 附加平均值的简单方法