我正在尝试实现 Evgeny Kluev 在他对 loop rolling algorithm 的回答中给出的算法 但我无法让它工作。下面是我按照他的指示尝试手工计算的示例:
text: ababacababd
<STEP 1>
suffixes and LCPs:
ababacababd
4
ababd
3
abacababd
2
abd
1
acababd
0
babacababd
3
babd
2
bacababd
1
bd
0
cababd
0
d
<STEP 2>
sorted LCP array indices: 0,1,5,2,6,3,7,4,8,9
(values) : 4,3,3,2,2,1,1,1,0,0
<STEP 3>
LCP groups sorted by position in text (format => position: entry):
lcp 4:
0: ababacababd
6: ababd
lcp 3:
1: babacababd
2: abacababd
6: ababd
7: babd
lcp 2:
2: abacababd
3: bacababd
7: babd
8: abd
lcp 1:
3: bacababd
4: acababd
8: abd
9: bd
lcp 0:
0: ababacababd
1: babacababd
4: acababd
5: cababd
9: bd
10: d
<STEP 4>
entries remaining after filter (LCP == positional difference):
none! only (abd),(bd) and (bacababd),(acababd) from LCP 1 group
have positional difference equal to 1 but they don't share prefixes
with each other. shouldn't i have at least (ab) and (ba) here?
谁能告诉我在这个过程中我做错了什么?
此外,他在第 4 步的末尾说我们应该在文本中包含所有可能的序列,他的意思是所有可能的重复序列吗?
这是一个已知的算法吗?我可以在其他地方找到更多详细信息的名称?
(我也对他在第 5 步中对相交序列的定义感到困惑,但如果我正确理解前面的步骤,也许它会有意义)。
编辑:在 Evgeny 的帮助说明之后,这是我对步骤 4、5、6 的了解:
<STEP 4>
filter pseudocode:
results = {}
for (positions,lcp) in lcp_groups:
results[lcp] = []
while positions not empty:
pos = positions.pop(0) #pop lowest element
if (pos+lcp) in positions:
common = prefix(input, pos, lcp)
if common.size() < lcp:
next
i = 1
while (pos+lcp*(i+1)) in positions:
if common != prefix(input, pos+lcp*i, lcp):
break
positions.delete(pos+lcp*i)
i += 1
results[lcp].append( (common, pos, i+1) )
application of filter logic:
lcp 4:
0: ababacababd # 4 not in {6}
6: ababd # 10 not in {}
lcp 3:
0: ababacababd # 3 not in {1,2,6,7}
1: babacababd # 4 not in {2,6,7}
2: abacababd # 5 not in {6,7}
6: ababd # 9 not in {7}
7: babd # 10 not in {}
lcp 2:
0: ababacababd # 2 in {1,2,3,6,7,8}, 4 not in {1,2,3,6,7,8} => ("ab", 0, 2)
1: babacababd # 3 in {2,3,6,7,8}, 5 not in {2,3,6,7,8} => ("ba", 1, 2)
2: abacababd # 4 not in {3,6,7,8}
3: bacababd # 5 not in {6,7,8}
6: ababd # 8 in {7,8}, 10 not in {7,8} => ("ab", 6, 2)
7: babd # 9 not in {8}
8: abd # 10 not in {}
lcp 1:
0: ababacababd # 1 in {1,2,3,4,6,7,8,9}, prefix is ""
1: babacababd # 2 in {2,3,4,6,7,8,9}, prefix is ""
2: abacababd # 3 in {3,4,6,7,8,9}, prefix is ""
3: bacababd # 4 in {4,6,7,8,9}, prefix is ""
4: acababd # 5 not in {6,7,8,9}
6: ababd # 7 in {7,8,9}, prefix is ""
7: babd # 8 in {8,9}, prefix is ""
8: abd # 9 in {9}, prefix is ""
9: bd # 10 not in {}
sequences: [("ab", 0, 2), ("ba", 1, 2), ("ab", 6, 2)]
<STEP 5>
add sequences in order of LCP grouping. sequences within an LCP group
are added according to which was generated first:
lcp 4: no sequences
lcp 3: no sequences
lcp 2: add ("ab", 0, 2)
lcp 2: dont add ("ba", 1, 2) because it intersects with ("ab", 0, 2)
lcp 2: add ("ab", 6, 2)
lcp 1: no sequences
collection = [("ab", 0, 2), ("ab", 6, 2)]
(order by position not by which one was added first)
<STEP 6>
recreate input by iterating through the collection in order and
filling in gaps with the normal input:
input = "ab"*2 + input[4..5] + "ab"*2 + input[10..10]
Evgeny,如果您碰巧再次看到这个问题,请问您一个简短的问题: 我是否正确执行了第 5 步?也就是说,我是否根据它们从哪个 LCP 组生成来添加序列(首先是具有较高 LCP 值的组)?还是与LCP有关?
此外,如果第 4 步或第 6 步有任何问题,请告诉我,但看起来我所用的方法非常适合这个示例。
最佳答案
我必须澄清原始答案中“按 LCP 值分组”的含义。事实上,对于具有选定 LCP 值的组,我们应该包括具有较大 LCP 值的所有条目。
这意味着对于您的示例,在处理 LCP3 时,我们需要将前面的条目 0 和 6 合并到该组中。在处理 LCP2 时,我们需要将所有前面的条目与 LCP3 和 LCP4 合并:0、1、2、6、7。
结果,过滤后剩下两对 (ab) 和一对 (ba)。但由于 (ba) 与第一对 (ab) “相交”,它在第 5 步被拒绝。
Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?
没错,我的意思是所有可能的重复序列。
Is this a known algorithm with a name that I can find more details on elsewhere?
我不知道。以前从未见过这样的算法。
以下是步骤 2 .. 4 的实现方式(伪代码):
for (in_sa, in_src) in suffix_array: # step 2
lcp_groups[max(LCP[in_sa.left], LCP[in_sa.right])].append((in_sa, in_src))
apply(sort_by_position_in_src, lcp_groups) # step 3
for lcp from largest downto 1: # step 4
# sorted by position in src array and unique:
positions = merge_and_uniq(positions, lcp_groups[lcp])
for start in positions:
pos = start
while (next = positions[pos.in_src + lcp]).exists
and LCP.RMQ(pos.in_sa, next.in_sa) >= lcp
and not(prev = positions[pos.in_src - lcp]).exists # to process each
and LCP.RMQ(pos.in_sa, prev.in_sa) >= lcp): # chain only once
pos = next
if pos != start:
pass_to_step5(start, lcp, pos + lcp)
这里我没有为positions
规划任何特定的数据结构。但为方便起见,假定一个有序的关联数组。 RMQ是Range Minimum Query,所以LCP数组应该做相应的预处理。
此代码与 OP 中的代码几乎相同。但它没有使用昂贵的字符串比较(如 common != prefix(input, pos+lcp*i, lcp)
),而是使用 RMQ,它(如果正确实现)几乎可以立即工作(并且具有相同的效果)作为字符串比较,因为当子字符串与前面的子字符串共有的起始字符太少时,它允许拒绝子字符串。
它具有二次最坏情况时间复杂度。对于像“aaaaaaaaa”这样的输入数组应该很慢。而且要找到“更好”字符串的时间复杂度并不容易,在“平均”情况下它可能是次二次的。同样的问题可以用更简单的二次时间算法来解决:
def loop_rolling(begin, end):
distance = (end - begin) / 2)
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
pass_to_step5(start, d, start + d * (repeats + 1))
start = pos
或者通过删除步骤 5 和 6 可以使它变得更简单。但是下面的变体有一个缺点。它太贪心了,所以它会找到 2*(2*(ab))ab 而不是 5*(ab):
def loop_rolling(begin, end, distance):
distance = min(distance, (end - begin) / 2))
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
loop_rolling(begin, start, d - 1)
print repeats+1, "*("
loop_rolling(start, start + d, d - 1) # "nested loops"
print ')'
loop_rolling(start + d * (repeats + 1), end, d)
return
else:
if d == 1: print src[start .. pos]
start = pos
关于string - "loop rolling algorithm"的解释,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19359259/