r - dcast.data.table 中未记录的错误

(这是之前发布在 data-table-help mailing list 上的，但已经有几个星期没有评论了，我做了更多的尝试来调试它。)

我遇到了一个奇怪的错误，互联网搜索仅出现在 data.table 的提交日志中:

# Error in dcast.data.table(test.table, as.formula(paste(class.col, "+",  : 
#   retFirst must be integer vector the same length as nrow(i)

这是在我通过随机重采样Trial并替换的data.table上运行先前测试的工作dcast.data.table表达式时出现的。有问题的部分是这样的:

dcast.data.table(test.table, 
                 Class + Time + Trial ~ Channel,
                 value.var = "Voltage",
                 fun.aggregate=identity)

它似乎被输入表中接近重复的行所阻塞(即，无论表中存在或不存在 id 列，错误都是相同的):

test.table <- structure(list(Trial = c(1169L, 1169L), Sample = c(155L, 155L
), Class = c(1L, 1L), Subject = structure(c(13L, 13L), .Label = c("s01", 
"s02", "s03", "s04", "s05", "s06", "s07", "s08", "s09", "s10", 
"s11", "s12", "s13"), class = "factor"), Channel = c(1L, 1L), 
    Voltage = structure(c(-0.992322316444497, -0.992322316444497
    ), "`scaled:center`" = -6.23438399446429e-16, "`scaled:scale`" = 1), 
    Time = c(201.149466192171, 201.149466192171), Baseline = c(0.688151312347969, 
    0.688151312347969), id = 1:2), .Names = c("Trial", "Sample", 
"Class", "Subject", "Channel", "Voltage", "Time", "Baseline", 
"id"), class = c("data.table", "data.frame"), row.names = c(NA, 
-2L), sorted = "id")    

test.table
#    Trial Sample Class Subject Channel    Voltage     Time  Baseline id
# 1:  1169    155     1     s13       1 -0.9923223 201.1495 0.6881513  1
# 2:  1169    155     1     s13       1 -0.9923223 201.1495 0.6881513  2
dcast.data.table(test.table, 
                  Class + Time + Trial ~ Channel,
                  value.var = "Voltage",
                  fun.aggregate=identity)
# Error in dcast.data.table(test.table, Class + Time + Trial ~ Channel,  : 
#   retFirst must be integer vector the same length as nrow(i)

更改 dcast 公式中的单个列接近我正在寻找的输出:

test.table[2,Trial:=1170]
dcast.data.table(test.table, 
                  Class + Time + Trial ~ Channel,
                  value.var = "Voltage",
                  fun.aggregate=identity)
#    Class     Time Trial          1
# 1:     1 201.1495  1169 -0.9923223
# 2:     1 201.1495  1170 -0.9923223

什么困扰了 data.table？我尝试更改键并打乱公式术语的顺序只是为了看看，因为我不明白这个错误，但这不起作用。

如果我用 reshape2 中的常规 dcast 替换函数调用，我会收到一个看似不相关的错误:

# Error in vapply(indices, fun, .default) : values must be length 0, but FUN(X[[29]]) result is length 1

此时，在我的代码中，我不关心 Trial 值是否正确，因此我可以通过在公式中将其替换为 id 来解决此问题，但我对更通用或更强大的解决方案感兴趣。

最佳答案

更新:在 commit 1253 中修复v1.9.3 的。来自 NEWS :

dcast.data.table provides better error message when fun.aggregate is specified but it returns length != 1. Closes git #693. Thanks to Trevor Alexander for reporting here on SO.

我同意错误消息应该更有助于理解问题，并且通常位于 data.table 中。这只是我没有预见到的情况。

如果可以的话请提交问题 here作为一个错误，我会在有时间的时候修复它。

然而，你的问题对我来说似乎是微不足道的 RTFM。来自 ?dcast.data.table:

fun.aggregate - Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to length with a message.

In the DETAILS section: "... fun.aggregate will have to be used. The aggregating function should take a vector as input and return a single value (or a list of length one) as output." ...

在您的示例中，公式的 LHS 会产生两个相同的行，这意味着必须使用 fun.aggregate - 如果您不使用，则默认为 length一个(就像 reshape2:::dcast 那样)。并且您已经使用了identity，它只会返回值。因此它返回了 Voltage 的两个值，这是该函数不喜欢的。

错误消息应该类似于:

Error: fun.aggregate should return, for each unique group (from formula's LHS), a length 1 vector, but returns length=2 for a group.

或者类似的东西。请随意建议更好/更清晰的错误消息。

PS:我不明白你所说的“接近重复”是什么意思。

identical(test.table[1, list(Class, Time, Trial)], 
          test.table[2, list(Class, Time, Trial)])
# [1] TRUE

如果您在 LHS 上使用 id 列，那么您应该能够获得所需的结果，因为您现在可以唯一地标识行...

dcast.data.table(test.table, 
             Class + Time + Trial ~ Channel + id,
             value.var = "Voltage",
             fun.aggregate=identity)

#    Class     Time Trial        1_1        1_2
# 1:     1 201.1495  1169 -0.9923223 -0.9923223

该函数仅考虑公式 LHS 中给出的列来确定是否存在唯一行，而不考虑实际输入数据是否具有唯一行(如果这是混淆的话)。

回答OP的第二条评论:

当前获得结果(没有错误)的唯一方法是您的函数返回一个列表:

dcast.data.table(test.table, 
             Class + Time + Trial ~ Channel,
             value.var = "Voltage",
             fun.aggregate=list)
#    Class     Time Trial                     1
# 1:     1 201.1495  1169 -0.9923223,-0.9923223

然后您可以检查列的长度是否全部为 1，如果是，则取消列出。

关于r - dcast.data.table 中未记录的错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24152733/

r - dcast.data.table 中未记录的错误

更新:在 commit 1253 中修复v1.9.3 的。来自 NEWS :

上一篇：objective-c - 用于比较核心数据属性的月份部分的谓词

下一篇：正则表达式省略花括号