r - 使用 dplyr 从起止范围变量按月聚合计数？

假设我有以此格式存储的学校注册数据，其中包含开始日期和结束日期字段:

<表类=“s-表”> <标题> 唯一名称 enrollment_start enrollment_end <正文> 艾米 2017 年 1 月 1 日 2018 年 9 月 30 日富兰克林 2017 年 1 月 1 日 2017 年 2 月 19 日富兰克林 2017 年 6 月 5 日 2018 年 2 月 4 日富兰克林 2018年10月21日 2019年3月9日萨米尔 2017 年 6 月 1 日 2017 年 2 月 4 日萨米尔 2017 年 4 月 5 日 2018年9月12日 ... ... ...

我想按月生成注册人数汇总，如下所示:

<表类=“s-表”> <标题> 月 enrollment_count <正文> 2017 年 1 月 25 2017 年 2 月 31 2017 年 3 月 19 2017 年 4 月 34 2017 年 5 月 29 2017 年 6 月 32 ... ...

有没有一种简单的方法可以使用 dplyr 来完成此任务？

我能想到的唯一方法是循环遍历范围从month_min到month_max的所有月份的列表，以计算每个月内开始或结束日期的行数。希望有更简单的代码。

最佳答案

我认为这可以通过 clock 非常优雅地完成。和 ivs包。您似乎想要每月计数，因此您可以使用时钟中的年月类型。 ivs 是一个专门用于处理数据间隔的软件包，这正是您在这里所拥有的。在这里，我们假设如果您的注册开始/结束在一个月内，那么您应该被视为在该月活跃。

library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)

enrollments <- tribble(
  ~unique_name, ~enrollment_start, ~enrollment_end,
  "Amy",        "1, Jan, 2017",    "30, Sep, 2018",
  "Franklin",   "1, Jan, 2017",    "19, Feb, 2017",
  "Franklin",   "5, Jun, 2017",    "4, Feb, 2018",
  "Franklin",   "21, Oct, 2018",   "9, Mar, 2019",
  "Samir",      "1, Jan, 2017",    "4, Feb, 2017",
  "Samir",      "5, Apr, 2017",    "12, Sep, 2018"
)

# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
  mutate(
    start = enrollment_start %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month"),
    end = enrollment_end %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month") %>%
      add_months(1),
    .keep = "unused"
  )

enrollments
#> # A tibble: 6 × 3
#>   unique_name start        end         
#>   <chr>       <ymd<month>> <ymd<month>>
#> 1 Amy         2017-01      2018-10     
#> 2 Franklin    2017-01      2017-03     
#> 3 Franklin    2017-06      2018-03     
#> 4 Franklin    2018-10      2019-04     
#> 5 Samir       2017-01      2017-03     
#> 6 Samir       2017-04      2018-10

# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
  mutate(active = iv(start, end), .keep = "unused")

enrollments
#> # A tibble: 6 × 2
#>   unique_name             active
#>   <chr>         <iv<ymd<month>>>
#> 1 Amy         [2017-01, 2018-10)
#> 2 Franklin    [2017-01, 2017-03)
#> 3 Franklin    [2017-06, 2018-03)
#> 4 Franklin    [2018-10, 2019-04)
#> 5 Samir       [2017-01, 2017-03)
#> 6 Samir       [2017-04, 2018-10)

# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L

months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#>    month       
#>    <ymd<month>>
#>  1 2017-01     
#>  2 2017-02     
#>  3 2017-03     
#>  4 2017-04     
#>  5 2017-05     
#>  6 2017-06     
#>  7 2017-07     
#>  8 2017-08     
#>  9 2017-09     
#> 10 2017-10     
#> # … with 17 more rows

# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
  mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#>    month        count
#>    <ymd<month>> <int>
#>  1 2017-01          3
#>  2 2017-02          3
#>  3 2017-03          1
#>  4 2017-04          2
#>  5 2017-05          2
#>  6 2017-06          3
#>  7 2017-07          3
#>  8 2017-08          3
#>  9 2017-09          3
#> 10 2017-10          3
#> # … with 17 more rows

^{由reprex package于2022年4月5日创建(v2.0.1)}

关于r - 使用 dplyr 从起止范围变量按月聚合计数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71621389/

r - 使用 dplyr 从起止范围变量按月聚合计数？

上一篇：c# - 目前不支持 5.6 之前的 MySQL 版本

下一篇：c++ - 使用 new 运算符返回指针。删除该放在哪里？