r - 基于日期范围的数据表合并

标签 r data.table

我有两张 table ,policiesclaims

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), 
                EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
   policyNumber       EFDT       EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01


claims<-data.table(claimNumber=c(1,2,3,4), 
                   policyNumber=c(123,123,123,124),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
                   claimAmount=c(10,20,20,15))
> claims
   claimNumber policyNumber   lossDate claimAmount
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15

保单表确实包含保单条款,因为每一行都由保单编号和生效日期唯一标识。

我想以将 claim 与政策条款相关联的方式合并这两个表。如果 claim 具有相同的保单编号,并且 claim 的 lossDate 在保单期限的生效日期和到期日期内(生效日期为包含范围,到期日期为不包含范围),则 claim 与保单条款相关联。我以这种方式合并表格?

这应该类似于左外连接。结果应该看起来像
   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:          125 2013-02-01 2014-02-01          NA       <NA>          NA

最佳答案

版本 1(针对 data.table v1.9.4+ 更新)

试试这个:

# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126), 
                     EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                     EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))

# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                   policyNumber=c(123,123,123,124,126,126),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                   claimAmount=c(10,20,20,15,5,25))

# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)

# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]

# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]

# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)

# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))

ans
#   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
#1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
#2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
#3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
#4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
#5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
#6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
#7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

版本 2

@Arun 建议使用新的 foverlaps函数来自 data.table .我在下面的尝试似乎更难,而不是更容易,所以请让我知道如何改进它。
## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate]  ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)

## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]

## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]

## Remove duplicates (such as policyNumber==123 & claimNumber==3),
##   and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
  unique(ans2), ## select only the unique values
  policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)

ans2
##    policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
## 1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
## 2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
## 3:          123 2012-01-01 2013-01-01           3 2013-01-01          20     TRUE
## 4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
## 5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
## 6:          126       <NA>       <NA>           6 2014-03-01          25    FALSE
## 7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

版本 3

使用 foverlaps() ,另一个版本:
require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))
foverlaps()需要开始和结束范围/间隔。因此,我们复制 lossDate列到 lossDate2 .

由于EXDT需要开区间,我们从中减去一个,并将其放入新列EXDTclosed .

现在,我们设置 key 。 foverlaps()要求最后两个关键列是间隔。所以它们是最后指定的。我们还希望通过 policyNumber 重叠连接来进行第一次匹配。 .因此,它也在 key 中指定。

我们需要在 claims 上设置 key (检查 ?foverlaps)。我们不必在 policies 上设置 key .但如果你愿意,你可以(然后你可以跳过 by.x 参数,因为它默认采用键值)。由于我们没有为 policies 设置 key 在这里,我们将明确指定 by.x 中的相应列争论。重叠类型默认为any ,我们不必更改(因此未指定)。这导致:
#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31

关于r - 基于日期范围的数据表合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21560500/

相关文章:

R:查找数据帧的每一行满足特定条件的最后一个值的位置

r - 模拟RStudio

R 将列表转换为 Data.Frame 或 Table

RTVS:无法使用 data.table 编织文档

r - 仅对非NA元素求和,但如果所有NA都返回NA

r - 如何在 fread 中指定变量文件名

使用 tidyr/data.table 与 data.frames 复制 `expand.grid()` 行为

r - 从同一数据框中另一列的值派生列

r - $ 运算符对原子向量无效

python - 使用 pandas,我如何有效地按组对大型 DataFrame 进行子采样?