c++ - 适应 Boyer-Moore 实现

我正在尝试调整 Boyer-Moore c(++) Wikipedia implementation获取字符串中模式的所有匹配项。实际上，维基百科实现返回第一个匹配项。主要代码如下:

char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
    int i;
    int delta1[ALPHABET_LEN];
    int *delta2 = malloc(patlen * sizeof(int));
    make_delta1(delta1, pat, patlen);
    make_delta2(delta2, pat, patlen);

    i = patlen-1;
    while (i < stringlen) {
        int j = patlen-1;
        while (j >= 0 && (string[i] == pat[j])) {
            --i;
            --j;
        }
        if (j < 0) {
            free(delta2);
            return (string + i+1);
        }

        i += max(delta1[string[i]], delta2[j]);
    }
    free(delta2);
    return NULL;
}

我曾尝试修改 if (j < 0) 之后的区 block 将索引添加到数组/vector 并让外循环继续，但它似乎不起作用。在测试修改后的代码时，我仍然只得到一个匹配项。也许这个实现并不是为了返回所有匹配项而设计的，它需要进行一些快速更改才能做到这一点？我不太了解算法本身，所以我不确定如何进行这项工作。如果有人能指出正确的方向，我将不胜感激。

注意:函数 make_delta1 和 make_delta2 在源代码中较早定义(查看维基百科页面)，而 max() 函数调用实际上也是在源代码中较早定义的宏。

最佳答案

Boyer-Moore 的算法利用了这样一个事实，即当在较长的字符串中搜索“HELLO WORLD”时，如果要在总的来说，有点像海战游戏:如果你在距离边界的四个单元格处找到公海，你就不需要测试剩下的四个单元格，以防那里藏着一个 5 单元格的航母；不可能。

例如，如果您在第 11 个位置找到一个“D”，它可能是 HELLO WORLD 的最后一个字母；但是如果你发现一个'Q'，'Q'不在HELLO WORLD 中的任何地方，这意味着搜索的字符串不能在前十一个字符中的任何地方，你可以完全避免搜索那里。另一方面，“L”可能意味着 HELLO WORLD 就在那里，从位置 11-3(HELLO WORLD 的第三个字母是 L)、11-4 或 11-10 开始。

搜索时，您使用两个增量数组跟踪这些可能性。

所以当你找到一个模式时，你应该这样做，

if (j < 0)
{
    // Found a pattern from position i+1 to i+1+patlen
    // Add vector or whatever is needed; check we don't overflow it.
    if (index_size+1 >= index_counter)
    {
        index[index_counter] = 0;
        return index_size;
    }
    index[index_counter++] = i+1;

    // Reinitialize j to restart search
    j = patlen-1;

    // Reinitialize i to start at i+1+patlen
    i += patlen +1; // (not completely sure of that +1)

    // Do not free delta2
    // free(delta2);

    // Continue loop without altering i again
    continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;

如果您传递类似于 size_t *indexes 的内容，这应该返回一个以零结尾的索引列表。到函数。

然后该函数将返回 0(未找到)、index_size(太多匹配)或 1 和 index_size-1 之间的匹配数。

例如，这允许添加额外的匹配项，而不必重复整个搜索已找到的 (index_size-1) 个子字符串；你增加num_indexes通过 new_num，realloc indexes数组，然后将偏移量 old_index_size-1 处的新数组传递给函数, new_num 作为新的大小，haystack 字符串从索引 old_index_size-1 处匹配的偏移量开始加上一个(不是，正如我在之前的修订版中所写，加上针串的长度；见评论)。

这种方法也会报告重叠匹配，例如在 banana 中搜索 ana 会找到 b*ana*na 和 ban*ana*.

更新

我测试了上面的内容，它似乎有效。我通过添加这两个包含来修改维基百科代码，以防止 gcc 提示

#include <stdio.h>
#include <string.h>

然后我修改了if (j < 0)简单地输出它找到的内容

    if (j < 0) {
            printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
            //free(delta2);
            // return (string + i+1);
            i += patlen + 1;
            j = patlen - 1;
            continue;
    }

最后我用这个测试了

int main(void)
{
    char *s = "This is a string in which I am going to look for a string I will string along";
    char *p = "string";
    boyer_moore(s, strlen(s), p, strlen(p));
    return 0;
}

如预期的那样得到了:

Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along

如果字符串包含两个重叠序列，则两者都被发现:

char *s = "This is an andean andeandean andean trouble";
char *p = "andean";

Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble

为避免重叠匹配，最快的方法是不存储重叠部分。它可以在函数中完成，但这意味着重新初始化第一个增量 vector 并更新字符串指针；我们还需要存储第二个 i索引为 i2以防止保存的索引变得非单调。这不值得。更好:

    if (j < 0) {
        // We have found a patlen match at i+1
        // Is it an overlap?
        if (index && (indexes[index] + patlen < i+1))
        {
            // Yes, it is. So we don't store it.


            // We could store the last of several overlaps
            // It's not exactly trivial, though:
            // searching 'anana' in 'Bananananana'
            // finds FOUR matches, and the fourth is NOT overlapped
            // with the first. So in case of overlap, if we want to keep
            // the LAST of the bunch, we must save info somewhere else,
            // say last_conflicting_overlap, and check twice.
            // Then again, the third match (which is the last to overlap
            // with the first) would overlap with the fourth.

            // So the "return as many non overlapping matches as possible"
            // is actually accomplished by doing NOTHING in this branch of the IF.
        }
        else
        {
            // Not an overlap, so store it.
            indexes[++index] = i+1;
            if (index == max_indexes) // Too many matches already found?
                break; // Stop searching and return found so far
        }
        // Adapt i and j to keep searching
        i += patlen + 1;
        j = patlen - 1;
        continue;
    }

关于c++ - 适应 Boyer-Moore 实现，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12702741/

c++ - 适应 Boyer-Moore 实现

上一篇：php - 大整数乘积差分算法

下一篇：algorithm - 六边形内有 6 个等边三角形，给定 x,y 坐标，如何确定坐标在哪个等边三角形中？