r - dplyr 中的快速字符串计数

我需要在小标题的列中保存的字符串中计算模式(固定)的实例。 mutate(x = str_count(col, pattern))完全符合我的要求，但速度不够快，无法处理我必须评估的字符串量。

简化版

test = tibble(
  id=c(1,2,3),
  seq=c("ATCG","ATTT","CGCG")
)

有效，但在微基准测试中速度太慢

test %>% mutate(CpG = str_count(seq, "CG"))

理论上在单个序列上更快，但不适用于我的 tibble 列

这只是给出了第一行计数的单个值

test %>% mutate(CpG = sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0))

我试图让 purrr::map 工作，但我失败了......这是我无法运行的尝试:

test %>% mutate(CpG = map(~sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0)))
test %>% mutate(CpG = map(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0)))

编辑:时间基准

看起来像 stringi::stri_count_fixed是测试集上的最优选择。

microbenchmark(
  mutate(test, CpG = stringr::str_count(seq, "CG")),
  mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0))),
  mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))),
  mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))),
  mutate(test, CpG = stringi::stri_count_fixed(seq,"CG"))
)  
Unit: microseconds
                                                                                          expr     min       lq     mean   median       uq      max neval
                                             mutate(test, CpG = stringr::str_count(seq, "CG")) 125.502 133.9995 159.0844 147.9045 164.2925  441.242   100
 mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x,      fixed = TRUE)[[1]] > 0))) 167.210 183.4695 215.7746 202.8890 230.5275  450.177   100
                        mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))) 789.889 827.3810 955.8919 887.3400 989.7415 1845.212   100
                             mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 150.315 159.4750 178.9039 169.1840 185.1345  316.479   100
                                      mutate(test, CpG = stringi::stri_count_fixed(seq, "CG")) 112.015 120.6615 131.4281 125.0470 135.4330  213.184   100

而且，这是我的真实数据的一部分(1500 nt 序列的约 20,000 个实例)在远程服务器上运行的结果，该服务器的可用 RAM 和内核是我初始运行的约 3 倍。我继续对 stringi 印象深刻的实现，它以某种方式在惊人的 20 毫秒内通读了所有内容。

Unit: milliseconds
                                                                                          expr        min         lq       mean     median        uq        max neval
                                             mutate(test, CpG = stringr::str_count(seq, "CG"))  402.55513  408.04101  419.82213  414.31085  425.4113  481.08128   100
 mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x,      fixed = TRUE)[[1]] > 0)))  429.84148  436.02402  474.08864  438.84198  447.3058 1054.48253   100
                        mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq)))  301.26062  309.76674  310.97071  310.08229  310.8837  336.12834   100
                             mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 1981.78309 1990.84206 2050.41928 1999.07389 2020.3675 2980.62566   100
                                      mutate(test, CpG = stringi::stri_count_fixed(seq, "CG"))   23.33313   23.55215   23.90881   23.66268   23.8916   27.40499   100

因为我可以在本地 DNAStringSet 上运行它，因为我的数据从一个 GRanges 对象开始，我也对序列向量进行了微基准测试，并获得了 240-250 毫秒的范围。

我还有另一个用例，它提出了 1,000-100,000 nt 长的序列，具有更长的搜索模式，并计划在我返回该项目时使用这些结果进行更新。

最佳答案

您可以使用 Biostrings::vcountPattern

library(Biostrings)
test %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq)))
#  # A tibble: 3 x 3
#     id seq     CpG
#  <dbl> <chr> <int>
#1     1 ATCG      1
#2     2 ATTT      0
#3     3 CGCG      2

对于较长(DNA)字符串的向量，Biostrings::vcountPattern 比 str_count 解决方案快得多；这是一些 microbenchmark 结果

# Sample data
df <- tibble(
    id = 1:100,
    seq = replicate(
        100,
        paste0(sample(c("A", "C", "G", "T"), 100000, replace = T), collapse = "")))


library(microbenchmark)
res <- microbenchmark(
    str_count = df %>% mutate(CpG = str_count(seq, "CG")),
    vmatchPattern = df %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq))))
#Unit: milliseconds
#          expr       min       lq     mean   median       uq      max neval
#     str_count 156.37642 162.1629 167.2005 164.4958 169.2456 251.3426   100
# vmatchPattern  95.68998 102.4176 108.2428 105.2501 108.5511 322.3143   100

library(ggplot2)
autoplot(res)

关于r - dplyr 中的快速字符串计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59781647/

r - dplyr 中的快速字符串计数

上一篇：ruby-on-rails - rails Controller 的速率限制

下一篇：Django - CreateView - 如何声明变量并在模板中使用它