假设我有以此格式存储的学校注册数据,其中包含开始日期和结束日期字段:
我想按月生成注册人数汇总,如下所示:
有没有一种简单的方法可以使用 dplyr 来完成此任务?
我能想到的唯一方法是循环遍历范围从month_min到month_max的所有月份的列表,以计算每个月内开始或结束日期的行数。希望有更简单的代码。
最佳答案
我认为这可以通过 clock 非常优雅地完成。和 ivs包。您似乎想要每月计数,因此您可以使用时钟中的年月类型。 ivs 是一个专门用于处理数据间隔的软件包,这正是您在这里所拥有的。在这里,我们假设如果您的注册开始/结束在一个月内,那么您应该被视为在该月活跃。
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
由reprex package于2022年4月5日创建(v2.0.1)
关于r - 使用 dplyr 从起止范围变量按月聚合计数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71621389/