r - 如何在R中的重复字符串中选择最长的ngram?

标签 r string dataframe substring gsub

我有一个如下所示的数据集(只是有更多行):

x = c("abov level", "abov level consist", "abov level consist price", 
"abov level consist price stabil", "abov level consist price stabil protract", 
"abov level consist price stabil protract period", "abov level consist price stabil protract period time", 
"abov level consist price stabil sinc", "abov level consist price stabil sinc last", 
"abov level consist price stabil sinc last autumn", "abov level consist price stabil some", 
"abov level consist price stabil some time", "abov over", "abov over come", 
"abov over come month", "abov precis", "abov precis level", "abov precis level depend", 
"abov precis level depend futur", "abov precis level depend futur energi", 
"abov precis level depend futur energi price", "abov precis level depend futur energi price develop"
)

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"
 [8] "abov level consist price stabil sinc"                
 [9] "abov level consist price stabil sinc last"           
[10] "abov level consist price stabil sinc last autumn"    
[11] "abov level consist price stabil some"                
[12] "abov level consist price stabil some time"           
[13] "abov over"                                           
[14] "abov over come"                                      
[15] "abov over come month"                                
[16] "abov precis"                                         
[17] "abov precis level"                                   
[18] "abov precis level depend"                            
[19] "abov precis level depend futur"                      
[20] "abov precis level depend futur energi"               
[21] "abov precis level depend futur energi price"         
[22] "abov precis level depend futur energi price develop"

正如您所看到的,有一个清晰的模式:在更改基数并再次重新启动该过程之前,一次将一个单词添加到前一个 ngram 中。我以第一个“ block ”为例:

 [1] "abov level"                                          
 [2] "abov level consist"                                  
 [3] "abov level consist price"                            
 [4] "abov level consist price stabil"                     
 [5] "abov level consist price stabil protract"            
 [6] "abov level consist price stabil protract period"     
 [7] "abov level consist price stabil protract period time"

对于像上面这样的每个“ block ”,我只会保留最长的句子/ngram。在上面的情况下,我只会保留第七行。对每个 block 执行此操作,我会得到:

    
 [1] "abov level consist price stabil protract period time"           
 [2] "abov level consist price stabil sinc last autumn"    
 [3] "abov level consist price stabil some time"                                              
 [4] "abov over come month"                                      
 [5] "abov precis level depend futur energi price develop"

谁能帮我做到这一点吗?

谢谢!

最佳答案

我们可以在 dplyr 中使用 filterlead

library(dplyr)
tibble(x) %>%
     filter((nchar(lead(x, default = last(x))) - nchar(x)) <= 0)

关于r - 如何在R中的重复字符串中选择最长的ngram?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65197900/

相关文章:

r - 将 tibble 转换为带有列标题的数据框

r - 我如何获得列联表?

c++ - Rcpp 精度问题

r - write.fwf 列名称与值不对齐

python - Cumsum 每行 pandas 过去 12 个月的列值

dataframe - 如何将大型 julia DataFrame 分区为箭头文件并在读取数据时按顺序处理每个分区

R - 将行提取为带双引号的字符串

php - 顺序 strpos() 比具有一个 preg_match 的函数更快?

javascript - JSLint 批准的创建长字符串的方法是什么?

php - 使用字符串调用 php 方法