java - 修改编辑距离算法以不计算所有距离

标签 java algorithm performance levenshtein-distance

我正在研究模糊搜索实现,作为实现的一部分,我们使用 Apache 的 StringUtils.getLevenshteinDistance。目前,我们正在为我们的模糊搜索寻求特定的最大平均响应时间。经过各种增强和一些分析后,花费最多时间的地方是计算 Levenshtein 距离。它大约占搜索字符串三个或更多字母的总时间的 80-90%。

现在,我知道这里可以做的事情有一些限制,但我已经阅读了以前的 SO 问题和 LD 的维基百科链接,如果有人愿意将阈值限制为设定的最大距离,那可以帮助减少花在算法上的时间,但我不确定如何准确地做到这一点。

If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[3]

下面您将看到来自 StringUtils 的原始 LH 代码。之后就是我的修改了。我试图从根本上计算设定长度与 i,j 对角线的距离(因此,在我的示例中,是 i,j 对角线上方和下方的两条对角线)。但是,这不可能是正确的,因为我已经做到了。例如,在最高的对角线上,它总是会选择正上方的单元格值,该值将为 0。如果有人能告诉我如何按照我所描述的那样实现这个功能,或者一些关于如何实现它的一般建议, 这将不胜感激。

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

我的修改(仅针对 for 循环):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

最佳答案

实现窗口的问题是处理每行中第一个条目左侧和最后一个条目上方的值。

一种方法是将您最初填写的值从 1 而不是 0 开始,然后忽略您遇到的任何 0。您必须从最终答案中减去 1。

另一种方法是用高值填充 first 和 above last 左边的条目,这样 minimum check 永远不会选择它们。这是我前几天不得不实现时选择的方式:

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

关于java - 修改编辑距离算法以不计算所有距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3866249/

相关文章:

java - Rest Web 服务中的双向 SSL/TLS 身份验证

java - libGDX 在静态类中处理纹理

algorithm - 计算递归关系 T(n)=T(n/log n) + Θ(1)

algorithm - 将元素放入多个容量袋中的最佳解决方案

c++ - g++ 对 -Ofast 做了哪些额外的优化?

java - Split 方法是将字符串拆分为单个字符串

java - 在 jetty 和 tomcat 上运行 spring hibernate web 应用程序

ruby - 在 Ruby 中实现数组的 to_s

java - 在 Java 与 C++ 中读取二进制文件

c++ - 多线程算法工作得更慢