r - 根据现有间隔创建所有可能的时间间隔表

标签 r

两个(大量)表当前具有“开始”和“结束”日期。我想合并这两个表,以便拥有可以从原始日期形成的所有可能的“开始”和“结束”日期集。例如,如果 int1 == 0:6,并且 int2 == 3:9,那么我想要三个间隔:0:2、3:6、7:9。

我尝试过 foverlaps 并手动创建所有可能的日期间隔,然后将数据合并到该表中。下面的代码显示了这些失败的玩具数据尝试。下面的预期输出应该清楚地表明我想要完成的任务。

现有的表非常庞大(数百万个 ID,每个 ID 都有多组日期)。

我目前正在尝试第三种方法...创建一个空表,每个 id 每行有 1 天(作为起始日和起始日)。这种方法的问题是,考虑到我需要覆盖的 ID 数量和年份,它的速度非常慢。已经快 20 个小时了,我的基表仍在创建中。之后,计划将使用 foverlaps 合并现有表。

我正在为这个问题而烦恼,如果您能提供任何帮助,我将不胜感激。

# load packages
library(data.table)
library(lubridate)
# create data
dt1<- data.table(id = rep(1111, 4),
           from_date = as.Date(c("2016-01-01", "2016-03-31","2016-09-02", "2016-09-03")), 
           to_date = as.Date(c("2016-03-15", "2016-09-01", "2016-09-02", "2016-12-15")), 
           progs = c("a1", "b1", "c1", "d1"))
setkey(dt1, id, from_date, to_date)    

dt2<- data.table(id = rep(1111, 4),
           from_date = as.Date(c("2016-02-01", "2016-04-01","2016-11-01", "2016-12-01")), 
           to_date = as.Date(c("2016-02-28", "2016-09-30", "2016-11-30", "2016-12-31")), 
           progs = c("a2", "b2", "c2", "d2"))
setkey(dt2, id, from_date, to_date)    

# expected (hoped for) output
id  from_date   to_date progs1  prog2
1111    1/1/2016    1/31/2016   a1  NA
1111    2/1/2016    2/28/2016   a1  a2
1111    2/29/2016   3/15/2016   a1  NA
1111    3/31/2016   3/31/2016   b1  NA
1111    4/1/2016    9/1/2016    b1  b2
1111    9/2/2016    9/2/2016    c1  b2
1111    9/3/2016    9/30/2016   d1  b2
1111    10/1/2016   10/31/2016  NA  d1
1111    11/1/2016   11/30/2016  d1  c2
1111    12/1/2016   12/15/2016  d1  NA
1111    12/16/2016  12/31/2016  NA  d2

# failed attempt #1: using foverlaps
overlaps <- foverlaps(x=dt1, y=dt2, 
                by.x = c("id", "from_date", "to_date"),
                by.y = c("id", "from_date", "to_date"), 
                type = "any", 
                mult ="all")
# this does not give every time interval    

# failed attempt #2... super convoluted method
# try to make every possible time interval ----
dt <- rbind(dt1[, .(id, from_date)], dt2[, .(id, from_date)]) 
dt.temp <- rbind(dt1[, .(id, to_date)], dt2[, .(id, to_date)]) # get table with to_dates
setnames(dt.temp, "to_date", "from_date") 
dt <- rbind(dt, dt.temp)
rm(dt.temp)
dt <- unique(dt)
setorder(dt, -from_date)
dt[, to_date := as.Date(c(NA, from_date[-.N]), origin = "1970-01-01"), by = "id"]
setorder(dt, from_date)
dt <- dt[!is.na(to_date)] # the last 'from_date' is actually the final to_date, so it doesn't begin a time interval
dt[, counter := 1:.N, by = id] # create indicator so we can know which interval is the first interval for each id
dt[counter != 1, from_date := as.integer(from_date + 1)] # to prevent overlap with previous interval
dt[, counter := NULL]
setkey(dt, id, from_date, to_date)    

# merge on dt1 ----
dt <- foverlaps(dt, dt1, type = "any", mult = "all")
dt[, from_date := i.from_date] # when dt1 didn't match, the from_date is NA. fill with i.from_date
dt[, to_date := i.to_date] # when dt2 didn't match, the from_date is NA. fill with i.from_date
dt[, c("i.from_date", "i.to_date") := NULL] # no longer needed
setkey(dt, id, from_date, to_date)    

# merge on dt2 ----
dt <- foverlaps(dt, dt2, type = "any", mult = "all")
dt[, from_date := i.from_date] # when dt2 didn't match, the from_date is NA. fill with i.from_date
dt[, to_date := i.to_date] # when dt2 didn't match, the from_date is NA. fill with i.from_date
dt[, c("i.from_date", "i.to_date") := NULL] # no longer needed
setkey(dt, id, from_date, to_date)    

setnames(dt, c("i.progs", "progs"), c("progs1", "progs2"))    

# Collapse data if dates are contiguous and data are the same ----
# Create unique ID for data chunks ----
dt[, group := .GRP, by = c("id", "progs1", "progs2")] # create group id
dt[, group := cumsum( c(0, diff(group)!=0) )] # in situation like a:a:a:b:b:b:b:a:a:a, want to distinguish first set of "a" from second set of "a"    

# Create unique ID for contiguous times within a given data chunk ----
setkey(dt, id, from_date)
dt[, prev_to_date := c(NA, to_date[-.N]), by = "group"]
dt[, diff.prev := from_date - prev_to_date] # difference between from_date & prev_to_date will be 1 (day) if they are contiguous
dt[diff.prev != 1, diff.prev := NA] # set to NA if difference is not 1 day, i.e., it is not contiguous, i.e., it starts a new contiguous chunk
dt[is.na(diff.prev), contig.id := .I] # Give a unique number for each start of a new contiguous chunk (i.e., section starts with NA)
setkey(dt, group, from_date) # need to order the data so the following line will work.
dt[, contig.id  := contig.id[1], by=  .( group , cumsum(!is.na(contig.id))) ] # fill forward by group
dt[, c("prev_to_date", "diff.prev") := NULL] # drop columns that were just intermediates    

# Collapse rows where data chunks are constant and time is contiguous ----      
dt[, from_date := min(from_date), by = c("group", "contig.id")]
dt[, to_date := max(to_date), by = c("group", "contig.id")]
dt[, c("group", "contig.id") := NULL]
dt <- unique(dt)      

# the end result is incorrect table
id  from_date   to_date progs2  progs1
1111    1/1/2016    2/28/2016   a2  a1
1111    2/29/2016   3/15/2016   NA  a1
1111    3/16/2016   3/31/2016   NA  b1
1111    4/1/2016    9/1/2016    b2  b1
1111    9/2/2016    9/2/2016    b2  c1
1111    9/3/2016    9/30/2016   b2  d1
1111    10/1/2016   11/30/2016  c2  d1
1111    12/1/2016   12/15/2016  d2  d1
1111    12/16/2016  12/31/2016  d2  NA

查看上面的预期结果和实际结果...我无法在此处的表格中整齐地显示它们。

最佳答案

不能 100% 确定您正在尝试执行的操作,但是,有一个称为 Crossing 的函数可以为您提供跨多个向量的所有排列。


> library(tidyr)
> a <- c("2016-01-01", "2016-03-31","2016-09-02", "2016-09-03")
> b <- c("2016-03-15", "2016-09-01", "2016-09-02", "2016-12-15")
> c <- rep(1111, 4)
> crossing(a, b,c)

# A tibble: 16 x 3
   a          b              c
   <chr>      <chr>      <dbl>
 1 2016-01-01 2016-03-15  1111
 2 2016-01-01 2016-09-01  1111
 3 2016-01-01 2016-09-02  1111
 4 2016-01-01 2016-12-15  1111
 5 2016-03-31 2016-03-15  1111
 6 2016-03-31 2016-09-01  1111
 7 2016-03-31 2016-09-02  1111
 8 2016-03-31 2016-12-15  1111
 9 2016-09-02 2016-03-15  1111
10 2016-09-02 2016-09-01  1111
11 2016-09-02 2016-09-02  1111
12 2016-09-02 2016-12-15  1111
13 2016-09-03 2016-03-15  1111
14 2016-09-03 2016-09-01  1111
15 2016-09-03 2016-09-02  1111
16 2016-09-03 2016-12-15  1111

如果您想要实现这一目标,这会是类似的事情吗?

关于r - 根据现有间隔创建所有可能的时间间隔表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58357536/

相关文章:

r - 给定 R 中的特定概率值生成随机数(0 和 1)

r - 将列表粘贴到矢量,为每个矢量级别重复列表

r - 如何检查向量的值是否在减少

r - 使用 optim、ML 将 Gamma 分布拟合到 R 中的数据

r - R中安装bsts包报错

r - 当我加载 ggfortify 时,自动绘图函数的行为有所不同

返回具有嵌套级别和值的嵌套列表

c++ - 使用 Rcpp 时如何并行化 C++ 代码?

r - 计算行的平均值,排除每行中的最高值和最低值

r - 无法编译 RcppArmadillo