r - Unique in data.table 错误地删除了一些值

标签 r data.table unique

我需要从具有 1 亿行的大数据框中删除重复项。我正在测试 data.table 是否可以帮助我。但是,在下面的代码中,data.table 中的 unique() 没有生成与 data.frame 中的 unique() 相同的结果。 data.table 中的 setkey 是否存在可能的错误?

library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521, 
                         1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659, 
                         1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241, 
                         1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126, 
                         1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805, 
                         1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493, 
                         1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id

Results from the unique(tmp$id):
 [1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644

Result from DTU$id:
 [1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644

比较两者,我们发现 1000001693466 在 DTU 中被错误地丢弃了。关于为什么的任何建议?我怀疑这是 setkey,因为当我从所有数字中减去 1000000000000 时,结果是一样的。

最佳答案

编辑(来自 Arun):默认舍入功能已在 current development version of data.table, v1.9.7 中删除,并且很可能会保持这种方式前进。参见 here安装说明。

这也意味着您有责任理解表示 float 和处理它们的局限性:-)。


help(setkey) 说(data.table 版本 1.9.6):

Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by default, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. Have a look at setNumericRounding to learn more.

通过在键入前将舍入更改为 1 个字节

DT <- data.table(tmp)
setNumericRounding(1)   # set rounding
setkey(DT, id) 

该值将不再被删除。

但是,help(setNumericRounding)

For large numbers (integers > 2^31), we recommend using bit64::integer64 rather than setting rounding to 0.

关于r - Unique in data.table 错误地删除了一些值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38668751/

相关文章:

R:解绑和使用 Packrat 快照的说明

R data.table,如果大于0,则将每个单元格值替换为1

c# - Entity Framework 中唯一字段的选项 - dbSet 的导航属性?

php - 在 MySQL 表中查找用户名的唯一 IP 地址

用于生成唯一数字的 C# 按位操作

r - 具有自回归项的 GLM 以校正序列相关性

r - R 中的函数 : How to Return Mean, 中值,同一函数内的标准差

在 ubuntu 上重现 CRAN GCC-UBSAN 测试结果 'at home'

r - 如何计算大型数据集每分钟出现的次数

r - 执行半反连接(在二进制搜索中)