r - 读取() : reading table with\r\r\n as newline symbol

我在文本文件中有制表符分隔的表格，其中所有行均以 \r\r\n (0x0D 0x0D 0x0A) 结尾。如果我尝试使用 fread() 读取此类文件，它会显示

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

但我没有下载这些文件，我已经有了它们。

到目前为止，我找到了首先使用 read.table() 读取文件的解决方案(它将 \r\r\n 组合视为单个端点 -行外字符)，然后通过 data.table() 转换生成的 data.frame:

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

但我想知道是否有任何方法可以避免缓慢的 read.table() 并使用快速的 fread() 。

最佳答案

我建议使用 GNU 实用程序 tr摆脱那些不必要的\r人物。例如

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

如果您使用的是 Windows 并且没有 tr实用程序，你可以得到它here .

已添加:

我使用 100,000 x 5 样本 cvs 数据集对三种方法进行了一些比较。

OPcsv就是“慢”read.table方法
freadScan是一种丢弃额外的 \r 的方法纯 R 中的字符
freadtr调用 GNU tr通过 shell 使用 fread()直接地。

第三种方法是迄今为止最快的。

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100

关于r - 读取() : reading table with\r\r\n as newline symbol，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33339656/

r - 读取() : reading table with\r\r\n as newline symbol

上一篇：shell - ctrl-z 暂停 tmux 内的 vim 不起作用

下一篇：java - AtomicReferenceFieldUpdater - 方法 set、get、compareAndSet 语义