r - 非英文环境下的Data.table、逻辑比较和编码错误/错误

标签 r encoding data.table

数据表给出警告,即使编码没有混合并且已知。合并不给出任何警告的唯一时间是当它们的编码都设置为未知时。这似乎不对,逻辑比较似乎表现不同并且忽略了编码。

我有两个问题,当两种编码已知且相同时,为什么数据表会有这种行为。我猜这是基于警告的错误(尽管很小)?

最后一次失败的合并可能是期望的行为,但逻辑比较不应该也失败吗?这让我想到了第二个问题,data.table 连接和逻辑比较有什么区别,因为在我上次合并中它们给出了不同的结果?

面对编码问题,逻辑比较似乎更稳健。

下面的代码和可重现的输出。 sessionInfo() 在下面。

library("data.table")

d.tst <- data.table(Nr = c("ÅÄÖ", "ÄÖR"))
d.tst2 <- data.table(Nr2 = c("ÅÄÖ", "ÄÖR"),
                     Dat = c(1, 2))

Encoding(d.tst$Nr)
# [1] "latin1" "latin1"
Encoding(d.tst2$Nr2)
# [1] "latin1" "latin1"

d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

d.tst$Nr <- iconv(d.tst$Nr, "LATIN1", "UTF-8")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "LATIN1", "UTF-8")

Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "UTF-8" "UTF-8"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,: A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

d.tst$Nr <- iconv(d.tst$Nr, "UTF-8", "cp1252")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "UTF-8", "cp1252")

Encoding(d.tst$Nr)
# [1] "unknown" "unknown"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

# Here we change the encoding on only one data.table

d.tst$Nr <- iconv(d.tst$Nr, "cp1252", "UTF-8")

#Check encoding
Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

# Logical comparison
d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE

# This merge fails completely, not just a warning, even if logic says they are the same
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

sessionInfo() 

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252    LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C                    
[5] LC_TIME=Swedish_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 RODBC_1.3-13    

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.2       assertthat_0.1 DBI_0.4-1      tools_3.3.1    tibble_1.1     Rcpp_0.12.5    chron_2.3-47

最佳答案

从新的 data.table 版本 1.9.8 开始,这应该得到修复。

例如:

# This merge fails completely, not just a warning, even if logic says they are the same
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

上面的代码在 1.9.6 中对我来说失败了(考虑到我的系统设置)。从 1.9.8 开始,它可以正常工作。

所以现在应该解决这个问题。

关于r - 非英文环境下的Data.table、逻辑比较和编码错误/错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39633211/

相关文章:

从图中删除点而不删除线

r - 在 R 中的 data.frame 中将小数点转换为千位分隔符

R summary() 为太多的 NA 提供了不正确的值

c# - 如何使用 C# 从 .txt 文件中读取西里尔符号

r - 确定 data.frame 的列何时更改值并返回更改的索引

r - arima.sim() 函数具有不同的 : sample sizes, phi 值和 sd 值

java - MySql 数据库中的数据不正确

ruby-on-rails - "\xC2"到 UTF-8 从 ASCII-8BIT 到 UTF-8 的转换

r - 扩展将 data.table 作为参数的函数以使用完整表(而不是子集)

r - data.table 相当于 tidyr::complete with group_by