r - 随着时间的推移(日期列)计算基于组的分类变量

假设我有以下 data :

<表类="s-表"> <头> <日>日期名字角色名 <正文> 2009-12-01 约翰 helper 2010-12-01 约翰 helper 2011-12-01 约翰高级 helper 2012-12-01 约翰经理 2009-12-01 将 helper 2010-12-01 将高级 helper 2011-12-01 将经理 2012-12-01 将高级经理

我正在尝试根据 rolename 列计算角色的数量对于 name 中的人列，此人迄今为止工作过。例如，对于上述数据，我想要第四列来衡量一个人到目前为止工作的职位数量:

<表类="s-表"> <头> <日>日期名字角色名没有立场 <正文> 2009-12-01 约翰 helper 1 2010-12-01 约翰 helper 1 2011-12-01 约翰高级 helper 2 2012-12-01 约翰经理 3 2009-12-01 将 helper 1 2010-12-01 将高级 helper 2 2011-12-01 将经理 3 2012-12-01 将高级经理 4

我失败的尝试:

#attempt 1
library(dplyr)

data %>%
group_by(name) %>%
mutate(nopositions = count(rolename))

#attempt2
library(runner)

data %>%
group_by(name) %>%
mutate(nopositions = runner(x = rolename,
                            k = inf,
                            idx = date,
                            f = function(x) length(x))

最佳答案

假设 date 的订单是确定的，

library(dplyr)
quux %>%
  group_by(name) %>%
  mutate(noposition = cummax(match(rolename, unique(rolename)))) %>%
  ungroup()
# # A tibble: 8 × 4
#   date       name  rolename       noposition
#   <chr>      <chr> <chr>               <int>
# 1 2009-12-01 John  helper                  1
# 2 2010-12-01 John  helper                  1
# 3 2011-12-01 John  senior helper           2
# 4 2012-12-01 John  manager                 3
# 5 2009-12-01 Will  helper                  1
# 6 2010-12-01 Will  senior helper           2
# 7 2011-12-01 Will  manager                 3
# 8 2012-12-01 Will  senior manager          4

如果没有 cummax，我们可能会逃脱，除非 name 返回到之前的 rolename，它的 noposition 将减少(恢复到以前的值)。但是，我们希望保留最近的最大值。

这确实假设 unique 保留了第一次出现的自然顺序。如果这有什么不对劲(我想不出什么东西)，我们可以做一个词窗口:

quux %>%
  group_by(name) %>%
  mutate(noposition = sapply(seq_along(rolename), \(i) length(unique(rolename[1:i])))) %>%
  ungroup()
# # A tibble: 8 × 4
#   date       name  rolename       noposition
#   <chr>      <chr> <chr>               <int>
# 1 2009-12-01 John  helper                  1
# 2 2010-12-01 John  helper                  1
# 3 2011-12-01 John  senior helper           2
# 4 2012-12-01 John  manager                 3
# 5 2009-12-01 Will  helper                  1
# 6 2010-12-01 Will  senior helper           2
# 7 2011-12-01 Will  manager                 3
# 8 2012-12-01 Will  senior manager          4

这在这里会产生相同的结果，并且它在更大的组中往往会表现得更差(因为它迭代得更多)。我将其作为扩展提供，以防假设排除使用 cummax(match(..))。

关于r - 随着时间的推移(日期列)计算基于组的分类变量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74858383/

r - 随着时间的推移(日期列)计算基于组的分类变量

上一篇：python - 有效地识别发生在开始和结束时间戳之间的事件

下一篇：c - 使用带有 va_arg 的自定义类型