有没有办法在不导入文件的情况下获取文件中的行数?
到目前为止,这就是我正在做的
myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
test[[i]] <- length(myfilesContent[[i]]$V1)
}
但是太耗时了,因为每个文件都很大。
最佳答案
如果你:
system2("wc"…
的系统调用会引起inline
感到舒适包裹那么以下应该尽可能快地获得(它几乎是内联 R C 函数中
wc
的“行数”部分):library(inline)
wc.code <- "
uintmax_t linect = 0;
uintmax_t tlinect = 0;
int fd, len;
u_char *p;
struct statfs fsb;
static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;
PROTECT(f = AS_CHARACTER(f));
if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
if (fstatfs(fd, &fsb)) {
fsb.f_iosize = SMALL_BUF_SIZE;
}
if (fsb.f_iosize != buf_size) {
if (buf != small_buf) {
free(buf);
}
if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
buf = small_buf;
buf_size = SMALL_BUF_SIZE;
} else {
buf_size = fsb.f_iosize;
}
}
while ((len = read(fd, buf, buf_size))) {
if (len == -1) {
(void)close(fd);
break;
}
for (p = buf; len--; ++p)
if (*p == '\\n')
++linect;
}
tlinect += linect;
(void)close(fd);
}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";
setCMethod("wc",
signature(f="character"),
wc.code,
includes=c("#include <stdlib.h>",
"#include <stdio.h>",
"#include <sys/param.h>",
"#include <sys/mount.h>",
"#include <sys/stat.h>",
"#include <ctype.h>",
"#include <err.h>",
"#include <errno.h>",
"#include <fcntl.h>",
"#include <locale.h>",
"#include <stdint.h>",
"#include <string.h>",
"#include <unistd.h>",
"#include <wchar.h>",
"#include <wctype.h>",
"#define SMALL_BUF_SIZE (1024 * 8)"),
language="C",
convention=".Call")
wc("FULLPATHTOFILE")
作为一个包会更好,因为它实际上必须在第一次编译时通过。但是,如果您确实需要“速度”,请引用此处。对于
189,955
我躺在周围的行文件,我得到(来自一堆运行的平均值): user system elapsed
0.007 0.003 0.010
关于r - 使用 R 获取文本文件中的行数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23456170/