r - data.table join 和 j-expression 意外行为

在 R 2.15.0和 data.table 1.8.9 :

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#   a value
#   3     4

d[J(3)][, value]
#   4

我希望两者产生相同的输出(第二个)，我相信他们应该。

为了澄清这不是 J语法问题，同样的期望适用于以下(与上述相同)表达式:

t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]

我希望以上两个都返回完全相同的输出。

所以让我重新表述这个问题 - 为什么( data.table 被设计成这样)关键列会在 d[t, value] 中自动打印出来?

更新(基于下面的回答和评论):谢谢@Arun 等人，我现在明白设计的原因了。上面打印key的原因是每次做data.table时都会有一个hidden by present通过 X[Y] 合并语法，以及 by是关键。它以这种方式设计的原因似乎如下 - 因为 by合并时必须执行操作，最好利用这一点，而不是做另一个by如果你打算通过合并的关键来做到这一点。

话虽如此，我相信这是一个语法设计缺陷。我的阅读方式data.table语法 d[i, j, by = b]是

take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b

by-without-by 打破了这个阅读，并介绍了必须特别考虑的情况(我是在 i 上合并，是 by 只是合并的关键，等等)。我相信这应该是 data.table 的工作- 值得称道的努力data.table在合并的一种特殊情况下更快，当 by等于键，应该以另一种方式完成(例如，通过内部检查 by 表达式是否实际上是合并的键)。

最佳答案

编辑号无限:常见问题解答 1.12 完全回答了您的问题:(同样有用/相关的是 FAQ 1.13 ，此处未粘贴)。

1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?

旧答案没有回答 OP 的问题(来自 OP 的评论)，保留在这里是因为我相信它确实如此)。

当您为 j 赋值时喜欢 d[, 4]或 d[, value]在 data.table , j被评估为 expression .来自 data.table FAQ 1.1 关于访问 DT[, 5] (第一个常见问题解答):

Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.

因此，首先要了解的是，就您而言:

d[, value] # produces a "vector"
# [1] 2 3 4 5 6

这在查询 i 时没有什么不同。是一个基本的索引，如:

d[3, value] # produces a vector of length 1
# [1] 4

但是，当 i 时，情况就不同了。本身就是一个 data.table .来自 data.table介绍(第 6 页):

d[J(3)] # is equivalent to d[data.table(a = 3)]

在这里，您正在执行 join .如果你只是做 d[J(3)]然后您将获得与该连接相对应的所有列。如果你这样做，

d[J(3), value] # which is equivalent to d[J(3), list(value)]

既然你说这个答案没有回答你的问题，我会指出你“改写”问题的答案，我相信，谎言:---> 那么您将只得到该列，但是由于您正在执行连接，因此也将输出键列(因为它是基于键列的两个表之间的连接)。

编辑:在您的第二次编辑之后，如果您的问题是为什么？，那么我不情愿地(或者说是无知的)回答，马修·道尔(Matthew Dowle)设计用于区分数据。表 join-based-subset和一个 index-based-subset廷操作。

您的第二个语法等效于:

d[J(3)][, value] # is equivalent to:

dd <- d[J(3)]
dd[, value]

再次，在 dd[, value] , j被评估为一个表达式，因此你得到一个向量。

回答你的第三个修改后的问题:第三次，这是因为它 是两个 data.tables 之间的 JOIN 基于键列。如果我加入两个 data.table s，我期待一个 data.table
来自 data.table再次介绍:

Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.

关于r - data.table join 和 j-expression 意外行为，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16093289/

r - data.table join 和 j-expression 意外行为

上一篇：twig - 在 Twig 中循环并在三次迭代后中断

下一篇：C# 中的十六进制变量