r read.table 项目太多

我有一个大小为 53 Gb 的文件，这是它的头部:

1   10  2873
1   100 22246
1   1000    28474
1   10000   35663
1   10001   35755
1   10002   35944
1   10003   36387
1   10004   36453
1   10005   36758
1   10006   37240

我在 CentOS7 64 位服务器上运行 R 3.3.2，内存为 128 Gb。我已经将 4098 个类似的文件读入 R。但是，我无法将最大的一个读入 R。

df <- read.table(f, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items

它返回错误提示“项目太多”。然后我跟着这个tip :

df5rows <- read.table(f, nrows=5, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
classes <- sapply(df5rows, class)
df <- read.table(f, nrows=3231959401, colClass=classes, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')

它仍然说“项目太多”，并且“引入了 NA”。我也尝试过不使用 colClasses，结果相同:

df <- read.table(f, nrows=3231959401, header=FALSE, col.names=c('a', 'b', 'dist'), sep='\t', quote='', comment.char='')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : har='')
  too many items
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  NAs introduced by coercion to integer range

使用的内存从未超过 90 Gb(当没有任何 nrows 或 colClasses 时，使用这些参数它从未超过 60 Gb)。我不明白为什么 R 无法读取文件。

我还检查过没有包含 4 列或更多列的行。

最佳答案

您是否尝试使用诸如(sed 或 VI)之类的轻型编辑器来剪切文件？然后你只需要合并这两个数据集。在具有大文件的非常相似的机器上，我遇到了同样的问题。它是一个垃圾行，关于文件的大小，会发生这些类型的错误。

关于r read.table 项目太多，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42923816/

r read.table 项目太多

上一篇：Python日志记录为ini文件中的处理程序添加过滤器？

下一篇：ckeditor - 使用 ckeditor 或 tinymce