R 数据表优化。 overdraw 财务数据

给定以下包含财务数据的 data.table(3500 万行):

DT:

userId  Date         balance  overdraft (boolean)
600     2014-11-01   -100     1
600     2014-11-02   1000     0
600     2014-11-03   -100     1
600     2014-11-04   -100     1
600     2014-11-05   100      0
600     2014-11-06   100      0
700     2014-11-01   -100     1
700     2014-11-02   1000     0
700     2014-11-03   -100     1
700     2014-11-04   -100     1
700     2014-11-05   -100     1
700     2014-11-06   100      0

案例:

a.- 最大总数。按userId连续 overdraw 天数。

userId  maxConsecutiveOverdraftDays   
600     2
700     3
800     0
900     1
1000    5

在本例中，我执行了以下操作:

acum = FALSE

for (i in 1:nrow(DT)) {

    if (DT[i]$overdraft == 1 ) {
      if (acum == TRUE) 
  {

        DT[i]$acumBalance <- DT[i]$balance + DT[i-1]$balance 
        DT[i]$totalConsecutiveOverdraftDays   <- DT[i]$overdraft + DT[i-1]$overdraft
  }

      if (DT[i]$userId == DT[i+1]$userId 
      && DT[i+1]$overdraft == 1 ) 
  {

        acum = TRUE
  }  
    else { acum = FALSE }
}
}

DT[,maxConsecutiveOverdraftDays:=max(totalConsecutiveOverdraftDays),by=userId]

需要超过 12 个小时才能完成。

如何改进代码并减少计算时间？

提前致谢。

最佳答案

不能说这是否会帮助您解决性能问题，但是 rle 在这里对于漂亮的短代码很有帮助。由于 overdraw 的值始终为零或一，因此我们可以取长度和值的乘积的最大值:

> aggregate(overdraft~userId, df, FUN=function(x) {
+   r <- rle(x)
+   max(r$lengths * r$values)
+ })
  userId overdraft
1    600         2
2    700         3

关于R 数据表优化。 overdraw 财务数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27494849/

R 数据表优化。 overdraw 财务数据

上一篇：java - 什么是类路径以及如何设置它？

下一篇：java - 如何以编程方式获取针对Java的自动完成建议？