我有以下数据:
data <- structure(list(user = c(1234L, 1234L, 1234L, 1234L, 1234L, 1234L,
1234L, 1234L, 1234L, 1234L, 1234L, 4758L, 4758L, 9584L, 9584L,
9584L, 9584L, 9584L, 9584L), time = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), fruit = structure(c(1L,
6L, 1L, 1L, 6L, 5L, 5L, 3L, 4L, 1L, 2L, 4L, 2L, 1L, 6L, 5L, 5L,
3L, 2L), .Label = c("apple", "banana", "lemon", "lime", "orange",
"pear"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), cum_sum = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 2L, 1L, 2L, 3L,
4L, 5L, 6L)), .Names = c("user", "time", "fruit", "count", "cum_sum"
), row.names = c(NA, -19L), class = "data.frame")
对于这个集合中的每个用户,我想查看一段时间内水果的顺序。但是,有些水果是“背靠背”上市的。
user time fruit count cum_sum
1 1234 1 apple 1 1
2 1234 2 pear 1 2
3 1234 3 apple 1 3
4 1234 4 apple 1 4
5 1234 5 pear 1 5
6 1234 6 orange 1 6
7 1234 7 orange 1 7
我正在寻找的更多是用户按独特水果的时间序列。
问题是,如果我按用户和水果分组然后总结,dplyr 会自动按字母顺序对水果进行排序:
data %>%
group_by(user, fruit) %>%
summarise(temp_var=1) %>%
mutate(cum_sum = cumsum(temp_var))
我真正想要的是,对于上面的用户 1234(例如),按照时间序列的顺序列出水果,但删除任何重复项。所以在我们看到苹果 > 梨 > 苹果 > 苹果 > 梨 > 橙 > 橙的地方,我们只会看到苹果 > 梨 > 苹果 > 梨 > 橙
最佳答案
所以使用 rleid
最新功能 data.table
CRAN 上的版本我们可以简单地做(虽然不确定您确切想要的输出)
library(data.table) ## v >= 1.9.6
res <- setDT(data)[, .(fruit = fruit[1L]), by = .(user, indx = rleid(fruit))
][, cum_sum := seq_len(.N), by = user
][, indx := NULL]
res
# user fruit cum_sum
# 1: 1234 apple 1
# 2: 1234 pear 2
# 3: 1234 apple 3
# 4: 1234 pear 4
# 5: 1234 orange 5
# 6: 1234 lemon 6
# 7: 1234 lime 7
# 8: 1234 apple 8
# 9: 1234 banana 9
# 10: 4758 lime 1
# 11: 4758 banana 2
# 12: 9584 apple 1
# 13: 9584 pear 2
# 14: 9584 orange 3
# 15: 9584 lemon 4
# 16: 9584 banana 5
关于r - dplyr + group_by 并避免按字母顺序排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31056524/