df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))
假设我有以下 df:
group start end
1 a 2017-05-01 2018-09-01
2 a 2019-04-03 2020-04-03
3 b 2011-03-03 2012-05-03
4 b 2014-05-07 2016-04-02
我想把它变成这种格式,每条记录分成开始日期和那年和随后几年的 31/12:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
关于如何解决这个问题的任何想法?
编辑:
我主要关注的不是同一年内的日期范围。但是,正如 chinsoon12 指出的那样,如果该方法也可以处理它们,确实会有所帮助,例如在此数据集中:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
最终结果将保留最后一行:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
最佳答案
一个可能的解决方案 data.table :
library(data.table)
setDT(df)
df[df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, rleid(start))][]
这使:
group start end 1: a 2017-05-01 2017-12-31 2: a 2018-01-01 2018-09-01 3: a 2019-04-03 2019-12-31 4: a 2020-01-01 2020-04-03 5: b 2011-03-03 2011-12-31 6: b 2012-01-01 2012-05-03 7: b 2014-05-07 2014-12-31 8: b 2015-01-01 2015-12-31 9: b 2016-01-01 2016-04-02 10: c 2017-02-01 2017-04-05
两个替代解决方案 data.table :
# alternative 1:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
, by = .(group, ri)][, ri := NULL][]
# alternative 2:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, ri)][, ri := NULL][]
使用数据:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
df[2:3] <- lapply(df[2:3], as.Date)
关于r - 将日期范围拆分为几个以 YYYY-12-31 结尾的 block ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50729220/