algorithm - 从字符串中查找子串

输入:字符串 S = AAGATATGATAGGAT。

输出:最大重复，如 GATA(位置 3 和 8)、GAT(位置 3、8 和 13)等...

最大重复是子串 t 在 S 中出现 k>1 次，如果 t 向左或向右扩展，则出现少于 k 次。
内部节点的叶子后代是后缀，每个后缀都有一个左字符。
如果所有叶子后代的左字符都不完全相同，则称为“左多样化”节点。
最大重复次数是左分集内部节点。

总体思路:

构建后缀树，然后对树进行DFS(深度优先搜索)
对于每片叶子，用它左边的字符标记它
对于每个内部节点:
如果至少有一个 child 被标记为*，则标记为*
否则如果它的 child 的标签是不同的，用*标记。
否则所有 child 都有相同的标签，将其复制到当前节点

以上想法是否正确？伪代码如何得到？然后我可以尝试自己写程序。

最佳答案

你的想法很好，但是有了后缀树你可以做一些更容易的事情。

Let T be the suffix tree of your sequence .

Let x be a node in T, T_x is the subtree of T with root x.

Let N_x be the number of leaf in T_x

Let word(x) be the word created by traversing T from root to node x

Now using the definition of a suffix tree we get :

Number of repeats of word(x) = N_x and the position of this words are the label of each leaf

此算法将是一个基本的树遍历，对于树中的每个节点计算 N_x，如果 N_x > 2 将其添加到您的结果中(如果您也想要位置，您可以添加每个叶子的标签)

伪代码:

input :

mySequence

output:

Result (list of word that repeat with count and position)

Algorithm :

T = suffixTree(mySequence)

For each internal node X in T:
  T_X = subTree(T)
  N_X = Number of lead (T_X)
  if N_X >=2 :
  Result .add ( [word(X), N_X , list(label of leafs)] )
return Result

示例:

让我们以后缀树的维基百科为例:“BANANA”:

enter image description here

我们得到:

N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2  so "N" repeats 2 times  in position {2,4}
N_NA=2  so "NA" repeats 2 times in position {2,4}

我发现这篇论文似乎以与您相同的方式处理您的问题，所以是的，我认为您的方法是写:

Spelling approximate repeated or common motifs using a suffix tree

Extract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Sigma. For instance, Sigma may be equal to {A, C, G, T} and the sequence represent an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint).[...]

你可以下载看看，作者给出了你算法的伪代码。

希望对你有帮助

关于algorithm - 从字符串中查找子串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7763484/

algorithm - 从字符串中查找子串

上一篇：algorithm - 将较小的矩形组合成较大的矩形

下一篇：algorithm - 直线多边形相交