我有以下收集政策演变的数据框:
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""))
看起来像这样:
Id_policy date_new date_end expend
A_001 20200101 20200503
A_002 20200115 20200608 A_001
A_003 20200304 20210101 A_002
B_001 20200110 20200403
B_002 20200215 20200503
“Id_policy”为具体保单,“date_new”为保单签发日期,“date_end”为保单终止日期.但是,有时政策会延长。在这种情况下,将设置一个新策略,变量“expend”提供它更改的先前策略的名称。
这里的想法是扁平化数据集,因此我们只保留与不同政策相对应的行。所以,输出将是这样的:
Id_policy date_new date_end expend
A_001 20200101 20210101
B_001 20200110 20200403
B_002 20200215 20200503
有人遇到过类似的问题吗?
最佳答案
一种方法是将此视为网络问题并使用 igraph
函数(相关帖子,例如 Make a group_indices based on several columns
; Fast way to group variables based on direct and indirect similarities in multiple columns)。
将缺少的'expend'设置为'Id_policy'
使用
graph_from_data_frame
创建一个图,其中 'expend' 和 'Id_policy' 列被视为边列表。使用
components
获取图的连接组件,即直接或间接连接的“Id_policy”。选择
membership
元素获取“每个顶点所属的簇id”。加入原始数据的成员资格。
获取按成员分组的相关数据。
我使用 data.table
进行数据整理步骤,但这当然也可以在 base
或 dplyr
中完成。
library(data.table)
library(igraph)
setDT(Df)
Df[expend == "", expend := Id_policy]
g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership
Df[.(names(mem)), on = .(Id_policy), mem := mem]
Df[ , .(Id_policy = Id_policy[1],
date_new = first(date_new),
date_end = last(date_end), by = mem]
# mem Id_policy date_new date_end
# 1: 1 A_001 20200101 20210101
# 2: 2 B_001 20200110 20200403
# 3: 3 B_002 20200215 20200503
关于r - 数据管理 : flatten data with R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66105022/