algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？

问题:

Consider the following data compression technique. We have a table of m text strings, each at most k in length. We want to encode a data string D of length n using as few text strings as possible. For example, if our table contains (a,ba,abab,b) and the data string is bababbaababa, the best way to encode it is (b,abab,ba,abab,a)—a total of five code words. Give an O(nmk) algorithm to find the length of the best encoding. You may assume that every string has at least one encoding in terms of the table.

我在skiena的算法设计手册中发现了这个问题并尝试解决。

我最好的猜测是将字符串(D)的所有可能的子字符串与长度(m)的表中的所有文本匹配，时间复杂度:O(n*n*k*m) .

有没有更好的方法在 O(nmk) 伪多项式时间内解决这个问题？

最佳答案

如果我误解了，请纠正我，但听起来你的解决方案的时间复杂度会比 O(nnkm) 大。

要生成包含 D 的所有元素的 D 的所有可能分区的列表，需要 O(nn)。然后，对于每个分区中的每个子集(其中最多有 n 个子集)，您需要验证表中是否存在匹配，每个子集需要 O(km)。在 n*n 个可能的分区之后(实际上不是 n*n，而是 n 的贝尔数)，匹配最多 n 个集合，每个集合将得到 O(nnnmk) 的复杂度

也就是说，我认为有一种更快的方法来尝试构建一棵树。

从 D 的开头开始，您可以尝试匹配表中的所有字符串。表中的每个字符串代表一个分支，如果表字符串在第一个位置与 D 不匹配，则分支终止。如果匹配，则从刚刚匹配的表字符串的末尾开始重复该过程。

此方法的一个技巧是您只需要保留产生唯一长度的一系列匹配项。以广度优先的方式重复这个过程意味着如果已经发现了一组达到特定长度的表字符串，那么任何再次达到该长度的表字符串都保证包含相同数量或更多的字符串。表但仍然匹配 D 的相同子字符串。例如，表字符串 {abab,ab} 都将产生长度为 4 的字符串 'abab'，但其中一个将在第一次迭代时匹配，一个将在第一次迭代时匹配第二。在这个例子中，包含 'ab'+'ab' 的编码总是比包含 'abab' 的编码差，所以我们丢弃了 'ab'+'ab' 编码。可以使用一个简单的 O(1) 查找表来检查这一点，并且以前见过的任何长度也可以终止其分支。因此，最多可以发现 n 个唯一长度，每个长度只能发现一次，这意味着最多有 n 个可能的分支。

仍然必须检查表中的字符串，这仍然是 O(mk)。添加分支，我相信这会产生 O(nmk) 的复杂度，因为最多创建 n 个分支。

例如，字符串 'aaaaaa' 上的表 {a, aa} 每次迭代将包含 2 个分支，共 3 次迭代。每个分支都需要 O(mk) 来检查，所以这个例子的时间复杂度是 O(nmk)。

编辑:由于对我提出的实现有些怀疑，我用 C 实现了它并做了一些分析。


#include <stdlib.h>
#include <string.h>
#include <stdio.h>

const char* alphabet = "abcdef";
const int alphabetLen = strlen(alphabet);

void printTable(char** table, int k){
    printf("Table: {%s", table[0]);
    for(int i=1; i<k; i++){
        printf(", %s", table[i]);
    }
    printf("}\n");
}

char** generateTable(int k, int m){

    char** table = (char**)malloc(sizeof(char*)*k);
    for(int i=0; i<k; i++){
        int tableStringLen = (rand() % (m-1)) + 1;
        table[i] = (char*)malloc(tableStringLen+1);
        for(int j=0; j<tableStringLen; j++){
            table[i][j] = alphabet[rand() % alphabetLen];
        }
        table[i][tableStringLen] = 0;
    }

    return table;
}

// Unfortunately, the length of the string is partially determined by the size of m
// it's done this way to avoid generating a random string and ensuring the table can encode it
char* generateString(char** table, int k, int m){
    int minLength = rand()%200 + m*2;
    char* string = (char*)malloc(minLength + m);

    int j = 0;
    for(int i=0; j<minLength; i++){
        int tableStringIndex = rand() % k;
        strcpy(string+j, table[tableStringIndex]);
        j += strlen(table[tableStringIndex]);
    }
    string[j] = 0;

    return string;
}

void printSolution(char* string, int* branchLengths, int n){
    if(!n){ return; }
    printSolution(string, branchLengths, branchLengths[n]);
    printf("%.*s ", n-branchLengths[n], string+branchLengths[n]);
}

int* solve(char* string, char** table, int n, int m, int k, int* comparisons){
    int* currentBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of all of the current ends
    int currentBranchCount = 1; // keeps track of how many active branch ends there are

    int* newBranchEnds = (int*)calloc(sizeof(int), n); // keeps track of the new branches on each iteration
    int newBranchCount = 0;

    int* branchLengths = (int*)calloc(sizeof(int), (n+1)); // used to reconstruct the solution in the end

    // used for O(1) length lookups
    int* tableStringLengths = (int*)malloc(sizeof(int) * k);
    for(int i=0; i<k; i++){ tableStringLengths[i] = strlen(table[i]); }

    *comparisons = 0;

    // continue searching while the entire string hasn't been encoded
    while(!branchLengths[n]){

        // for every active branch
        for(int i=0; i<currentBranchCount; i++){


            // try all table strings
            for(int j=0; j<k; j++){
                int newBranchEnd = currentBranchEnds[i] + tableStringLengths[j];
                
                // if the new length (branch end) already exists OR
                // if the new branch would be too long for the string, discard the branch
                *comparisons += 1;
                if(newBranchEnd > n || branchLengths[newBranchEnd]){ continue; }

                // check to see if the table string matches the target string at this position
                // could be done with strcmp, but is done in a loop here to be explicit about complexity
                char match = 1;
                for(int l=0; table[j][l]; l++){
                    *comparisons += 1;
                    if(string[currentBranchEnds[i] + l] != table[j][l]){
                        match = 0;
                        break;
                    }
                }

                // if it matches, we can create a new branch at this position
                *comparisons += 1;
                if(match){
                    branchLengths[newBranchEnd] = currentBranchEnds[i];
                    newBranchEnds[newBranchCount] = newBranchEnd;
                    newBranchCount += 1;
                }
            }
        }

        // swap the branch ends arrays to save copying
        int* tmp = currentBranchEnds;
        currentBranchEnds = newBranchEnds;
        newBranchEnds = tmp;

        currentBranchCount = newBranchCount;
        newBranchCount = 0;
    }

    free(currentBranchEnds);
    free(newBranchEnds);
    free(tableStringLengths);

    return branchLengths;
}

int main(){
    int k = rand() % 30 + 2;
    int m = rand() % 15 + 2;

    char** table = generateTable(k, m);
    printTable(table, k);

    char* string = generateString(table, k, m);
    int n = strlen(string);
    printf("String: %s\n", string);

    int comparisons;
    int* solution = solve(string, table, n, m, k, &comparisons);
    printf("Solution: ");
    printSolution(string, solution, n);
    printf("\n");
    printf("Comparisons: %d\n", comparisons);

    for(int i=0; i<k; i++){ free(table[i]); }
    free(table);
    free(solution);
    free(string);
}

分析是通过对随机生成的 n、m 和 k 值运行 >500000 次算法生成的。如代码所示，“比较”的次数在每个 if 语句中都会增加，并且绘制了每个 n、m 和 k 值的平均比较次数。

比较次数(计算得出的)与 n、m 和 k 的大小之间显然存在线性关系。这表明复杂度是 O(nmk)

关于algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69025441/

algorithm - 如何在伪多项式时间内从文本字符串表中找到字符串的最佳编码？

上一篇：flutter - 是否可以在 Flutter Web 和 Flutter App 中实现 Agora Video Call？

下一篇：python - 操作 numpy 数组(连接内部子数组)