我有一列字符串名称,我想找到经常出现的模式(单词)。 有没有办法返回长度大于(或等于)X 的字符串,并且在整个列中出现的次数多于 Y 次?
column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....)
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
> what times
[1] bla 5
[2] okay 5
引用Tim的评论:
我想删除嵌套的,所以如果有“aaaaaaa”和“aaaa”并且两者都会出现在输出中,则只有“aaaaaaa”和出现的次数才算在内。
如果 atleaststringsize=3
和 atleaststringsize=4
,两者的输出将相同。假设 atleasttimes=10
,“aaaaaaaa”出现 15 次,“aaaaaa”出现 15 次,那么:
getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
> what times
[1] aaaaaaaa 15
和
getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10)
> what times
[1] aaaaaaaa 15
停留时间最长的一个,atleast=3和atleast=4都是一样的。
最佳答案
它没有经过任何测试,也不会赢得任何速度比赛:
getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){
res <-
lapply(column,function(x){
lapply(atleaststringsize:nchar(x),function(y){
if(uniqueInColumns){
unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
}else{
substring(x, 1:(nchar(x)-y+1), y:nchar(x))
}
})
})
orderedRes <- unlist(res)[order(unlist(res))]
encodedRes <- rle(orderedRes)
partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
list(what = values[check], times = lengths[check])})
testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})
lapply(partRes, '[', !testRes)
}
column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what
"bla" "okay"
$times
5 5
getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"
$times
[1] 6
getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"
$times
[1] 4
关于python - 从 R 或 Python 中的列中获取经常出现的字符串模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16757306/