我正在尝试 gregexpr 在大字符串中搜索“ABCD”的位置,并在同一字符串中搜索“ABBD、ACCD、AAAD”的位置。我想在数据表的两个单独列中输出“ABCD”搜索结果和“ABBD、ACCD、AAAD”搜索结果。
我目前的做法是分别使用gregexpr,将每个导出为1列txt文件,将每个导入为矩阵,对每个1列矩阵进行排序,使数字按行升序,列绑定(bind)两个矩阵,并将得到的两列矩阵转换为一个数据表。
在处理非常大的字符串时,这种方法似乎效率很低,并且需要相当长的时间才能完成。有什么办法可以优化程序吗?谢谢你的帮助!
# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"
# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")
# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")
# Function that BINDS Matrices by column
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])
# Combine two matrices to create 2 column matrix
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)
最佳答案
fixed=T
gregexpr()
的论点,这可能会产生性能优势。来自 https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html : If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).
sort()
立即对第二列进行排序,而不是存储中间变量,然后用 order()
对其进行索引. cbind.fill()
函数可以工作,但是 NA 填充的任务可以通过越界索引轻松完成,为此 R 自然会为越界索引返回 NA。 因此:
x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
## [,1] [,2]
## [1,] 1 5
## [2,] 13 9
## [3,] 29 17
## [4,] NA 21
## [5,] NA 25
然后你可以写
comb_mat
根据您的 write.table()
保存到文件中称呼。编辑:正如你(现在我)所发现的,
gregexpr()
在大字符串上表现出奇的差,你的 237MB 字符串绝对是一个大字符串。来自 Fast partial string matching in R ,我们可以使用stringi
包以加快性能。下面是如何使用 stringi::stri_locate_all()
的演示完成您的要求。一些注意事项:x
, 你可以看到我用 data.table::fread()
加载它,如 read.table()
时间太长了。 1:outmax
。了。 因此:
library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
## user system elapsed
## 3.687 0.359 4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
## user system elapsed
## 4.938 0.454 5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
## out1 out2
## [1,] 1 5
## [2,] 13 9
## [3,] 29 17
## [4,] 33 21
## [5,] 45 25
## [6,] 61 37
tail(comb_mat);
## out1 out2
## [37031245,] NA 236999961
## [37031246,] NA 236999973
## [37031247,] NA 236999977
## [37031248,] NA 236999985
## [37031249,] NA 236999989
## [37031250,] NA 236999993
nrow(comb_mat);
## [1] 37031250
关于regex - R 用于非常大的字符串的更快的 gregexpr,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31216299/