regex - R 用于非常大的字符串的更快的 gregexpr

我正在尝试 gregexpr 在大字符串中搜索“ABCD”的位置，并在同一字符串中搜索“ABBD、ACCD、AAAD”的位置。我想在数据表的两个单独列中输出“ABCD”搜索结果和“ABBD、ACCD、AAAD”搜索结果。

我目前的做法是分别使用gregexpr，将每个导出为1列txt文件，将每个导入为矩阵，对每个1列矩阵进行排序，使数字按行升序，列绑定(bind)两个矩阵，并将得到的两列矩阵转换为一个数据表。

在处理非常大的字符串时，这种方法似乎效率很低，并且需要相当长的时间才能完成。有什么办法可以优化程序吗？谢谢你的帮助!

# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"

# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")    

# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")

# Function that BINDS Matrices by column
cbind.fill <- function(...){
  nm <- list(...)
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow))
  do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}

# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])

# Combine two matrices to create 2 column matrix 
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)

最佳答案

不需要中间文件。

我会使用 fixed=T gregexpr() 的论点，这可能会产生性能优势。来自 https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html :

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

您可以使用 sort()立即对第二列进行排序，而不是存储中间变量，然后用 order() 对其进行索引.

您的 cbind.fill()函数可以工作，但是 NA 填充的任务可以通过越界索引轻松完成，为此 R 自然会为越界索引返回 NA。

因此:

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
##      [,1] [,2]
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   NA   21
## [5,]   NA   25

然后你可以写comb_mat根据您的 write.table() 保存到文件中称呼。

编辑:正如你(现在我)所发现的，gregexpr()在大字符串上表现出奇的差，你的 237MB 字符串绝对是一个大字符串。来自 Fast partial string matching in R ，我们可以使用stringi包以加快性能。下面是如何使用 stringi::stri_locate_all() 的演示完成您的要求。一些注意事项:

对于我自己的测试，我创建了自己的 237MB 文件，实际上它的大小正好是 237,000,001 字节。我基本上用vim重复你的32字节示例字符串7,406,250次，总共237,000,000字节，额外的字节来自vim附加的LF。我将我的测试文件命名为 x , 你可以看到我用 data.table::fread() 加载它，如 read.table()时间太长了。

我对我的 NA-padding 算法做了一个小的改动。我意识到我们可以将向量的长度分配给最大长度，而不是使用越界索引，利用赋值运算符的从右到左的关联性。这里的好处是我们不必构造索引向量1:outmax。了。

因此:

library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
##    user  system elapsed
##   3.687   0.359   4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
##    user  system elapsed
##   4.938   0.454   5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
##      out1 out2
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   33   21
## [5,]   45   25
## [6,]   61   37
tail(comb_mat);
##             out1      out2
## [37031245,]   NA 236999961
## [37031246,]   NA 236999973
## [37031247,]   NA 236999977
## [37031248,]   NA 236999985
## [37031249,]   NA 236999989
## [37031250,]   NA 236999993
nrow(comb_mat);
## [1] 37031250

关于regex - R 用于非常大的字符串的更快的 gregexpr，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31216299/

regex - R 用于非常大的字符串的更快的 gregexpr

上一篇：com - 如何在 Windows Server 2008 上使用 32 位 COM 对象(适用于 2008 R2 但不适用于 2008)

下一篇：php - 消息发送 Telegram 机器人 (PHP)