r - 提高在大字符串向量上计算词分数总和的性能?

标签 r string performance loops vectorization

我有一个看起来像这样的字符串:

 [1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

以及为每个单词分配分数的数据框:
   word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

我想为我的每个字符串分配其中包含的单词分数的总和,我的解决方案是以下函数
 score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

我遇到的问题是性能问题,我有大约 50 万个字符串和超过一百万个单词,在我的 I-7 16GB 机器上使用该功能需要一个多小时。
此外,解决方案只是感觉不雅,笨重..

有更好(更有效)的解决方案吗?

重现数据:
 my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", 
"What are some major social faux pas to avoid when visiting Malta ", 
"Will Ready Boost technology contribute to CFD software usage ", 
"Who is Jon Snow ", "Do weighing scales measure mass or weight ", 
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ", 
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ", 
"What are some really good positive comments to say on a cricket field to your teammates ", 
"Is Donald Trump fact free ")


word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)

最佳答案

您可以使用 tidytext::unnest_tokens 标记为单词然后加入并聚合:

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>% 
    unnest_tokens(word, string, 'words', to_lower = FALSE) %>% 
    distinct() %>%
    left_join(word_scores) %>% 
    group_by(id) %>%
    summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#>       id score
#>    <int> <int>
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

如果您愿意,可以保留原始字符串,或者在最后通过 ID 重新加入它们。

在小数据上它要慢得多,但它在规模上变得更快,例如当my_strings重新采样到长度为 10,000:

Unit: milliseconds
     expr        min         lq      mean    median        uq       max neval
   Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502   100
   sapply  460.75930  486.94336  511.2762  503.4932  532.2363  746.8376   100
 tidytext   86.92182   94.65745  101.7064  100.1487  107.3289  134.7276   100

关于r - 提高在大字符串向量上计算词分数总和的性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43565864/

相关文章:

java - String.substring 在 Java 中究竟做了什么?

c - 读取文本文件行中特定字符串后面的所有数字

java - 速度: Java serialization or csv?

C# 使用命名空间语句排序

r - 更改 ggplot2 中 strip 文本背景的高度无法按预期工作

用向量的值替换列中向量的给定索引

R:从列表对象创建自定义输出

c - C 中的字符串指针打印奇怪的符号

performance - 求解递归公式的高效算法

r - 在ggplot2中叠加多边形并使叠加透明