r - 当键列数不同时合并data.table

标签 r merge data.table

我试图从文档中了解data.table中的逻辑,并且有点不清楚。我知道我可以尝试一下,看看会发生什么,但我想确保没有病理情况,因此想知道逻辑的实际编码方式。当两个data.table对象的键列数不同时,例如a的键数为2,而b的键数为3,并且您运行c <- a[b]ab会简单地合并在前两个键列上,还是会自动合并a中的第三列到b中的第三个键列?一个例子:

require(data.table)
a <- data.table(id=1:10, t=1:20, v=1:40, key=c("id", "t"))
b <- data.table(id=1:10, v2=1:20, key="id")
c <- a[b]

这应该选择与a中的id键列匹配的b行。例如,对于id==1中的bb中有2行,a中有4行,应该在c中生成8行。确实确实是这样:
> head(c,10)
    id  t  v v2
 1:  1  1  1  1
 2:  1  1 21  1
 3:  1 11 11  1
 4:  1 11 31  1
 5:  1  1  1 11
 6:  1  1 21 11
 7:  1 11 11 11
 8:  1 11 31 11
 9:  2  2  2  2
10:  2  2 22  2

尝试它的另一种方法是:
d <-b[a]

这应该做同样的事情:对于a中的每一行,都应该选择b中的匹配行:由于a包含一个额外的键列t,因此该列不应用于匹配,而只能基于第一个键列进行联接, id应该完成。似乎是这种情况:
> head(d,10)
    id v2  t  v
 1:  1  1  1  1
 2:  1 11  1  1
 3:  1  1  1 21
 4:  1 11  1 21
 5:  1  1 11 11
 6:  1 11 11 11
 7:  1  1 11 31
 8:  1 11 11 31
 9:  2  2  2  2
10:  2 12  2  2

有人可以确认吗?需要明确的是:是任何合并中曾经使用的a的第三键列,还是data.table仅使用两个表的min(length(key(DT)))

最佳答案

好问题。首先,正确的术语是(来自?data.table):

[A data.table] may have one key of one or more columns. This key can be used for row indexing instead of rownames.



因此,“键”(单数)不是“键”(复数)。目前,我们可以摆脱“ key ”的束缚。但是,将来添加辅助 key 时,可能会有多个 key 。每个键(单数)可以具有多个列(复数)。

否则,你是绝对正确的。以下段落在v1.8.2中进行了改进,基于其他人也感到困惑的反馈。从?data.table:

When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then many rows of x will ordinarily match to each row of i since not all of x's key columns will be joined to (a common use case). If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 to column 2, and so on) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant. The columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key.



在注释之后,在v1.8.3(在R-Forge上)中,现在显示为(以粗体显示的更改):

When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 of i's key to column 2 of x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)).

关于r - 当键列数不同时合并data.table,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12920803/

相关文章:

r - 如何使用 R 的 data.table 包对键值的否定进行子集化?

r - 从分布到置信区间的寓言

r - 在 dplyr 中使用动态位置数创建滞后/超前变量

postgresql - 如何像SVN那样管理一个有分支和主干的数据库

r - 同时使用 colClasses 和选择 fread 的参数

r - 定义要删除 data.table 中 ID 重复项的变量

重新启动 Shiny 的 session

r - 使用 R 在传单 map 上投影我的 shapefile 数据

java - 将 2 个数组链接或合并为 1 个数组并在 java 中对它们进行排序

python - 将 pandas DataFrame 与 NaN 合并以查找缺失行