python - 从 R 或 Python 中的列中获取经常出现的字符串模式

我有一列字符串名称，我想找到经常出现的模式(单词)。有没有办法返回长度大于(或等于)X 的字符串，并且在整个列中出现的次数多于 Y 次？

column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....) 
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
>     what   times 
[1]   bla    5
[2]   okay   5

引用Tim的评论:

我想删除嵌套的，所以如果有“aaaaaaa”和“aaaa”并且两者都会出现在输出中，则只有“aaaaaaa”和出现的次数才算在内。

如果 atleaststringsize=3 和 atleaststringsize=4，两者的输出将相同。假设 atleasttimes=10，“aaaaaaaa”出现 15 次，“aaaaaa”出现 15 次，那么:

getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
>    what      times
[1]  aaaaaaaa    15

和

getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10) 
>    what      times
[1]  aaaaaaaa    15

停留时间最长的一个，atleast=3和atleast=4都是一样的。

最佳答案

它没有经过任何测试，也不会赢得任何速度比赛:

getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){

  res <- 
  lapply(column,function(x){
    lapply(atleaststringsize:nchar(x),function(y){
      if(uniqueInColumns){
        unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
      }else{
        substring(x, 1:(nchar(x)-y+1), y:nchar(x))
      }
    })
  })

  orderedRes <- unlist(res)[order(unlist(res))]
  encodedRes <- rle(orderedRes)

  partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
                               list(what = values[check], times = lengths[check])})
  testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})

  lapply(partRes, '[', !testRes)

}


column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what

 "bla" "okay" 

$times

5 5 


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"

$times
[1] 6


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"

$times
[1] 4

关于python - 从 R 或 Python 中的列中获取经常出现的字符串模式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16757306/

python - 从 R 或 Python 中的列中获取经常出现的字符串模式

上一篇：python - 尽管超时，urllib2.urlopen 将永远挂起

下一篇：python - 高速公路 WAMP 服务器的命令式客户端？