我需要从具有 1 亿行的大数据框中删除重复项。我正在测试 data.table 是否可以帮助我。但是,在下面的代码中,data.table 中的 unique() 没有生成与 data.frame 中的 unique() 相同的结果。 data.table 中的 setkey 是否存在可能的错误?
library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521,
1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659,
1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241,
1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126,
1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805,
1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493,
1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id
Results from the unique(tmp$id):
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Result from DTU$id:
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
比较两者,我们发现 1000001693466 在 DTU 中被错误地丢弃了。关于为什么的任何建议?我怀疑这是 setkey,因为当我从所有数字中减去 1000000000000 时,结果是一样的。
最佳答案
编辑(来自 Arun):默认舍入功能已在 current development version of data.table, v1.9.7 中删除,并且很可能会保持这种方式前进。参见 here安装说明。
这也意味着您有责任理解表示 float 和处理它们的局限性:-)。
help(setkey)
说(data.table 版本 1.9.6
):
Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by default, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. Have a look at setNumericRounding to learn more.
通过在键入前将舍入更改为 1 个字节
DT <- data.table(tmp)
setNumericRounding(1) # set rounding
setkey(DT, id)
该值将不再被删除。
但是,help(setNumericRounding)
说
For large numbers (integers > 2^31), we recommend using bit64::integer64 rather than setting rounding to 0.
关于r - Unique in data.table 错误地删除了一些值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38668751/