不同年份的滚动计数

标签 r dplyr

这可能很容易,但我还没有弄清楚。

这是我的数据集的一部分:

structure(list(Patent = c("4683202", "4683195", "4800159", "4965188", 
"4994368", "5328824", "4879214", "4921794", "4983728", "4994372"
), subclass = c("435/91.2", "435/91.2", "435/91.2", "435/91.2", 
"435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2"
), AppYear = c(1985L, 1986L, 1986L, 1987L, 1987L, 1987L, 1988L, 
1988L, 1990L, 1990L), app = 1:10, class = "data.frame", row.names = c(NA, 
-10L), .Names = c("Patent", "subclass", "AppYear", "app", "lag(AppYear)"
))


> data
# A tibble: 10 x 3
  Patent  subclass AppYear
   <chr>   <chr>      <int>
 1 4683202 435/91.2    1985
 2 4683195 435/91.2    1986
 3 4800159 435/91.2    1986
 4 4965188 435/91.2    1987
 5 4994368 435/91.2    1987
 6 5328824 435/91.2    1987
 7 4879214 435/91.2    1988
 8 4921794 435/91.2    1988
 9 4983728 435/91.2    1990
10 4994372 435/91.2    1990

首先,我需要获得不同年份“应用程序”的滚动计数。其次,我需要创建不同年份“滞后(AppYear)”的滞后,如果前一年相同,则将获取第 1 年的行。

期望输出
# A tibble: 10 x 5
   Patent  subclass AppYear   app `lag(AppYear)`
   <chr>   <chr>      <int> <int>          <int>
 1 4683202 435/91.2    1985     1             NA
 2 4683195 435/91.2    1986     2           1985
 3 4800159 435/91.2    1986     2           1985
 4 4965188 435/91.2    1987     3           1986
 5 4994368 435/91.2    1987     3           1986
 6 5328824 435/91.2    1987     3           1986
 7 4879214 435/91.2    1988     4           1987
 8 4921794 435/91.2    1988     4           1987
 9 4983728 435/91.2    1990     5           1988
10 4994372 435/91.2    1990     5           1988

编辑 整个数据集包括许多子类,因此我需要首先按 subclass 分组.数据现在以这种方式排序:
data <- data %>% 
  select(Patent, subclass, AppYear) %>% 
  arrange(AppYear,Patent) %>% 
  group_by(subclass) %>% 
  mutate(app = 1:n(), lag(AppYear))

.
structure(list(Patent = c("4683202", "4683195", "4800159", "4965188", 
"4994368", "5328824", "4879214", "4921794", "4983728", "4994372", 
"5066584", "5075216", "5091310", "5093245", "5132215", "5185243", 
"5409818", "5409818", "6107023", "4994370", "5001050", "5023171", 
"5035996", "5035996", "5043272", "5045450", "5055393", "5085983", 
"5106729", "5106729"), subclass = c("435/91.2", "435/91.2", "435/91.2", 
"435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2", 
"435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2", 
"435/91.2", "435/91.21", "435/91.2", "435/91.2", "435/91.2", 
"435/91.2", "435/91.2", "435/91.2", "435/91.21", "435/91.2", 
"435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.21"), 
    AppYear = c(1985L, 1986L, 1986L, 1987L, 1987L, 1987L, 1988L, 
    1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
    1988L, 1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 
    1989L, 1989L, 1989L, 1989L, 1989L), app = c(1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 1L, 
    17L, 18L, 19L, 20L, 21L, 22L, 2L, 23L, 24L, 25L, 26L, 27L, 
    3L), `lag(AppYear)` = c(NA, 1985L, 1986L, 1986L, 1987L, 1987L, 
    1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
    1988L, NA, 1988L, 1988L, 1988L, 1989L, 1989L, 1989L, 1988L, 
    1989L, 1989L, 1989L, 1989L, 1989L, 1989L)), class = "data.frame", row.names = c(NA, 
-30L), .Names = c("Patent", "subclass", "AppYear", "app", "lag(AppYear)"
))

我试图得到 app使用多种方法,例如 cumsum(1:length(AppYear))但找不到成功的答案。

最佳答案

更新:
解决关于具有多组 subclass 的 df 的后续问题.

library(dplyr)

df1 %>% 
  select(Patent, subclass, AppYear) %>% 
  arrange(AppYear, Patent) %>%
  group_by(subclass) %>% 
  group_map(~mutate(.,app=group_indices(.,AppYear),
                    lag_year = rep(lag(unique(.$AppYear)), count_(., "AppYear")$n)), 
            keep = T) %>% 
  bind_rows() %>% 
  arrange(AppYear, Patent) 

#> # A tibble: 30 x 5
#>    Patent  subclass AppYear   app lag_year
#>    <chr>   <chr>      <int> <int>    <int>
#>  1 4683202 435/91.2    1985     1       NA
#>  2 4683195 435/91.2    1986     2     1985
#>  3 4800159 435/91.2    1986     2     1985
#>  4 4965188 435/91.2    1987     3     1986
#>  5 4994368 435/91.2    1987     3     1986
#>  6 5328824 435/91.2    1987     3     1986
#>  7 4879214 435/91.2    1988     4     1987
#>  8 4921794 435/91.2    1988     4     1987
#>  9 4983728 435/91.2    1988     4     1987
#> 10 4994372 435/91.2    1988     4     1987
#> # ... with 20 more rows
注意我正在使用问题的“编辑”部分下由 OP 提供的数据。

原答案:
library(dplyr)

df1 %>% 
  arrange(AppYear, Patent) %>%
  mutate(app = group_indices(.,AppYear), 
        lag_year = rep(lag(unique(.$AppYear)), count_(., "AppYear")$n))

#> # A tibble: 10 x 5
#>    Patent  subclass AppYear   app lag_year
#>    <chr>   <chr>      <int> <int>    <int>
#>  1 4683202 435/91.2    1985     1       NA
#>  2 4683195 435/91.2    1986     2     1985
#>  3 4800159 435/91.2    1986     2     1985
#>  4 4965188 435/91.2    1987     3     1986
#>  5 4994368 435/91.2    1987     3     1986
#>  6 5328824 435/91.2    1987     3     1986
#>  7 4879214 435/91.2    1988     4     1987
#>  8 4921794 435/91.2    1988     4     1987
#>  9 4983728 435/91.2    1990     5     1988
#> 10 4994372 435/91.2    1990     5     1988
数据:
df1 <- structure(list(Patent=c("4683202", "4683195", "4800159", "4965188", 
                      "4994368", "5328824", "4879214", "4921794", "4983728", "4994372"), 
                 subclass=c("435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2",
                      "435/91.2", "435/91.2", "435/91.2", "435/91.2", "435/91.2"), 
                 AppYear=c(1985L, 1986L, 1986L, 1987L, 1987L, 1987L, 1988L, 
                      1988L, 1990L, 1990L)), 
                 row.names=c(NA, -10L), 
                 class=c("tbl_df", "tbl", "data.frame"))

关于不同年份的滚动计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56840843/

相关文章:

r - R 在内部如何表示 NA?

r - 如何传递 R 函数参数以从 df 中选择行

r - 使用多图 (ggplot2) 和 grid.arrange (gridExtra) 的错误

r - 使用 dplyr 从数据框中删除遵循过滤器阈值的所有行

r - 如何使用 R (tidyverse) 中的旧名称和新名称表重命名表中的列?

r - 安装 flextable 库时出错

r - 是否可以从另一个包中的函数 @inheritParams ?

r - R Markdown 中的预定义 CSS 属性

r - map + pmap,找不到变量

r - 循环遍历行并计算与 R 中的多个条件匹配的行数