r - data.table 未按两列正确汇总

我遇到了 data.table 的问题，这最近让我发疯。它看起来像一个错误，但可能是我在这里遗漏了一些明显的东西。

我有以下数据框:

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

我想通过门户和月份来总结它。用旧的 tapply 进行总结可以按预期工作 - 我得到 3x2 矩阵，其中包含 2012 年 12 月和 2013 年 1 月至 2 月的结果:

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

使用 data.table 的 by 参数进行汇总不会:

> data[, mean(satisfaction), by = 'month,portal']
   month      portal        V1
1: 2013-01-01  FALSE 10.000000
2: 2013-02-01   TRUE  9.000000
3: 2013-01-01   TRUE 10.000000
4: 2012-12-01  FALSE  8.500000
5: 2012-12-01   TRUE  7.333333
6: 2013-02-01   TRUE  9.666667
7: 2013-02-01  FALSE  9.000000
8: 2012-12-01   TRUE 10.000000

如您所见，它返回一个包含 8 个值的数据表，而不是预期的 6 个值；例如，portal == TRUE 和 month == 2012-02-01 的值重复。

有趣的是，如果我将其限制为 2013 年的数据，一切都会按预期进行:

> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
        month portal        V1
1: 2013-01-01   TRUE 10.000000
2: 2013-01-01  FALSE 10.000000
3: 2013-02-01   TRUE  9.545455
4: 2013-02-01  FALSE  9.000000

我很困惑，难以置信:)。有人可以帮我吗？

最佳答案

这是一个已知问题，已在 data.table 1.8.7 中解决(在撰写本文时尚未在 CRAN 中解决)。

来自 data.table NEWS :

BUG FIXES

    <...>

o   setkey could sort 'double' columns (such as POSIXct) incorrectly when not the
    last column of the key, #2484. In data.table's C code :
        x[a] > x[b]-tol
    should have been :
        x[a]-x[b] > -tol  [or  x[b]-x[a] < tol ]
    The difference may have been machine/compiler dependent. Many thanks to statquant
    for the short reproducible example. Test added.

使用 install.packages("data.table", repos="http://R-Forge.R-project.org") 更新到 1.8.7 后，一切按预期运行.

关于r - data.table 未按两列正确汇总，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15077232/

r - data.table 未按两列正确汇总

上一篇：java - 如何在 while 循环内使用 Scanner hasNextInt() ？

下一篇：java - Gurobi - 构建约束时出现问题(Java)