R - 如何找到最长的重复序列及其频率

标签 r dataframe subsequence


29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43

实际上还有很多数据,但我想保持简短。我想在 R 中找到一种方法来查找所有行的最长公共(public)子序列,并按频率(递减)排序,其中仅报告序列中具有多个元素且频率超过一个的那些公共(public)子序列。有没有办法在 R 中做到这一点?


[29] 3
[30] 2 
( etc for all the single duplicates across each row and their frequencies )
[46  47] 2
[39  40  43] 3
[40, 43] 2


您似乎在问两种不同的问题。您想要 1) 连续运行单个值的列长度和 2) ngram 的计数(非连续)(按行)但按列计数。

# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))


    val occurrence
  <int>      <int>
1    29          3
2    30          2
3    40          2
4    43          2
5    43          2

# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
    nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
    Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))

ngram 的输出

       val occurrence
     <chr>      <int>
1    39 40          2
2    46 47          2
3    40 43          3
4 39 40 43          2


ans <- rbind(single, restof)


       val occurrence
     <chr>      <int>
1       29          3
2       30          2
3       40          2
4       43          2
5       43          2
6    39 40          2
7    46 47          2
8    40 43          3
9 39 40 43          2


df <- read.table(text="29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43")

关于R - 如何找到最长的重复序列及其频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46224921/


