r - 从每个案例的创建时间开始计算未结案例的更有效方法

标签 r performance data.table dplyr vectorization

我正在尝试找到一种更有效的方法来计算每个案例创建时未结案例的数量。个案在其创建日期/时间戳和审查日期/时间戳之间是“开放的”。您可以复制粘贴下面的代码来查看一个简单的功能示例:

# Create a bunch of date/time stamps for our example
two_thousand                <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_one            <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_two            <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_three          <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_four           <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_five           <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_six            <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_seven          <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eight          <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_nine           <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_ten            <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eleven         <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

mid_two_thousand            <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_one        <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_mid_two    <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_three      <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_four       <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_five       <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_six        <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_seven      <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eight      <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_nine       <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_ten        <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eleven     <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

# Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open"
comparison_table    <- data.table(id        = 1:10,
                                  created   = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one),
                                  censored  = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four));

# Create a table that has the creation date/time stamps at which we want to count all the open cases
check_table         <- data.table(id        = 1:12,
                                  creation  = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); 

# I use the DPLYR library as the group_by() + summarize() functions make this operation simple
library(dplyr);

# Group by id to set parameter for summarize() function 
check_table_grouped <- group_by(check_table, id);

# For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table
# EDIT: Also added timing to compare with method below
system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation))));

# Result is as desired
check_table_summary;              

# EDIT: Added @David-arenburg's solution with timing
library(data.table);
setDT(check_table)[, creation2 := creation];
setkey(comparison_table, created, censored);
system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]);

# Same results as above
foverlaps_table;

这种方法适用于小型数据集,例如本例中的数据集。然而,即使我使用矢量化操作,计算时间也会呈指数增长,因为操作计数是:(3 * nrow comparisons) * (nrow sum(nrow) calculations)。在 nrow=10,000 时,时间约为 14s,在 nrow=100,000 时,时间 > 20 分钟。我的实际 nrow 是 ~ 1,000,000。

是否有更有效的方法来进行此计算?我目前正在研究多核选项,但即使是那些也只会线性减少执行时间。感谢您的帮助。谢谢!

编辑:添加了@David-arenburg 的data.table::foverlaps 解决方案,它也适用于 nrow < 1000 并且速度更快。但是,它比 summarize 大量行的解决方案。在 10,000 行时,它的长度是原来的两倍。在 50,000 行时,我放弃了 10 倍的等待时间。有趣的是,foverlaps 解决方案似乎不会触发自动垃圾回收,因此一直处于最大 RAM(我的系统为 64GB),而 summarize 解决方案会定期触发自动垃圾回收收集,所以永远不会超过 ~ 40GB 的 RAM。我不确定这是否与速度差异有关。

最终编辑:我以一种使受访者更容易生成具有合适的创建/审查日期时间的大表的方式重写了问题。我还简化并更清楚地解释了问题,明确查找表非常大(违反 data.table::foverlaps 假设)。我什至内置了时序比较功能,使大型案例的测试变得 super 简单!详情在这里:Efficient method for counting open cases at time of each case's submission in large data set

再次感谢大家的帮助! :)

最佳答案

又一个 foverlaps 解决方案。假设 comparison_table 不是太大

library(data.table);
setkey(comparison_table, created, censored);    
times <- sort(unique(c(comparison_table$created, comparison_table$censored)))
dt <- data.table(creation=times+1)[, creation2 := creation];
setkey(dt, creation, creation2)
x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N
check_table$newcol <- x[findInterval(check_table$creation, times)]

关于r - 从每个案例的创建时间开始计算未结案例的更有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34213315/

相关文章:

r - 如何按顺序逐行更新

删除 data.table 中的因子级别

r - 为每行插入一个数字输入 - R Shiny

javascript - javascript中的float数组压缩

R - 将对象存储在列表中

sql-server - 索引 View : How to choose the Clustered Index?

python - 更快地调用函数求值

将 value 替换为 r data.table 中的 Factor

r - 我们如何才能找到对规则的先验支持和信心?

r - 如何使用管道运算符或相关工具将管道分为两个步骤?