我有一个带有一堆开始
和结束
日期的dataframe
,我正在循环遍历日期列表并查看有多少行在我的数据框中,在列表中的该日期期间“开放”(即开始日期已发生但结束日期尚未发生)。
我目前正在使用 lapply
执行此操作,但我想知道是否可以在 dplyr
中完成此操作,以及在内存和速度方面是否有任何好处(实际数据帧为 150 万行)。
RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
temp <- data.frame(RollingDateRange)
dat <- data.frame(
Order = c(1,1,1,2,2,2,3,3,3),
Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
Count = c(1,1,1,1,1,1,1,1,1),
stringsAsFactors = FALSE)
temp$Count <- lapply(temp$RollingDateRange, function(d){
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
输出:
> temp
RollingDateRange Count
1 2020-01-25 0
2 2020-01-26 0
3 2020-01-27 0
4 2020-01-28 0
5 2020-01-29 0
6 2020-01-30 0
7 2020-01-31 0
8 2020-02-01 3
9 2020-02-02 6
10 2020-02-03 8
11 2020-02-04 7
12 2020-02-05 7
13 2020-02-06 7
14 2020-02-07 5
15 2020-02-08 4
16 2020-02-09 2
最佳答案
考虑使用矢量索引来vapply
,这可能会减少lapply
处理。具体来说,与返回列表的 lapply
不同,默认情况下返回向量的 sapply
vapply
(类似于 sapply
) 返回具有定义类型和长度的特定向量:
temp$Count <- vapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) |
((Start.Date <= d) & (is.na(End.Date)))])
total <- sum(b, na.rm = TRUE)
}, numeric(1))
您的简单示例显示了时间上的明显差异:
system.time( {
temp$Count <- lapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF DATA FRAME RETURNING ALL COLUMNS
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) |
((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
})
# user system elapsed
# 0.003 0.000 0.005
system.time( {
temp$Count <- vapply(temp$RollingDateRange, function(d){
# LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) |
((Start.Date <= d) & (is.na(End.Date)))])
total <- sum(b, na.rm = TRUE)
}, numeric(1))
})
# user system elapsed
# 0.001 0.000 0.001
比较其他建议的解决方案,这些解决方案可能因机器和软件包版本而异。
# @akrun's SOLUTION
system.time( {
temp %>%
pull(RollingDateRange) %>%
map_dfr(~
dat %>%
filter((Start.Date <= .x & End.Date >= .x)|
(Start.Date <= .x & is.na(End.Date))) %>%
pull(Count) %>%
sum %>%
tibble(RollingDateRange = .x, Count = .))
})
# user system elapsed
# 0.029 0.000 0.029
# @RonakShah's SOLUTION
system.time({
temp %>%
mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat,
sum(Count[(Start.Date <= .x & End.Date >= .x) |
(Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))
})
# user system elapsed
# 0.002 0.000 0.001
关于r - 使用 dplyr 代替 lapply,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60141003/