R - 从排序数据构建新变量

标签 r dataframe data.table sequence

这是关于 this 的更新/跟进题。答案概述了他们不符合新要求。

我正在寻找一种有效的方法(data.table?)为每个 ID 构造两个新度量。

措施一和措施二需要满足以下条件:

条件一: 查找三行的序列:

  • 第一个 count > 0
  • 第二个“count >1”和
  • 第三个 count ==1

措施 1 的条件 2:

  • 取序列第三行 product 中元素的值:
  • 在序列第二行的product
  • 不在顺序第一行的stock中。

措施 2 的条件 2:

  • 取序列最后一行 product 中元素的值:
  • 不在序列第二行的产品
  • 不在顺序第一行的stock中。

数据:

df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

> df2
   ID seqs count product     stock
1   1    1     2       A         A
2   1    2     1       B       A,B
3   1    3     3       C     A,B,C
4   1    4     1   A,C,E   A,B,C,E
5   1    5     1     A,B   A,B,C,E
6   1    6     2   A,B,C   A,B,C,E
7   1    7     3       D A,B,C,D,E
8   2    1     1       A         A
9   2    2     2       B       A,B
10  2    3     1       A       A,B
11  3    1     3       A         A
12  3    2     1   A,B,C     A,B,C
13  3    3     4       D   A,B,C,D
14  3    4     1       D   A,B,C,D

所需的输出如下所示:

   ID seq1 seq2 seq3 measure1   measure2
1:  1    2    3    4   C         E 
2:  2    1    2    3    
3:  3    2    3    4   D

你会如何编写代码?

最佳答案

要做到这一点,您需要了解以下几点:

  • shift 函数比较组中的值
  • separate_rows 函数拆分您的字符串以获取规范化数据 View 。
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
                  seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
                  count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
                  product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
                  stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]

dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]


dt <- dt[count > 0 & count.2 > 1 &  count.3 == 1]
dt <- unique(dt, by = "ID")

library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")

dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[, 
  .(
    measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE), 
    measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
  ),
  ID
]

dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1:  1         C         E
# 2:  2      <NA>      <NA>
# 3:  3         D      <NA>

关于R - 从排序数据构建新变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57529311/

相关文章:

r - 在 tidyverse 中按时间顺序对月份名称进行排序

r - 使用 Dplyr 添加比例列

python - 在运行 EM 记录链接算法之前是否应该删除重复条目?

r - 根据行值连接列名

r - R : How to set fpc argument (finite population correction) 中的调查包

r - 从 xml2 和 rvest 子集 data.frame 时出错

scala - 使用 Spark Dataframe scala 将多个不同列转换为 Map 列

r - 在R中为data.table按组查找最大值的索引

r - 访问 X[Y, j] 中 j 中具有重复名称的 Y 列合并

r - 使用另一个表中的数据连接和覆盖一个表中的数据