java - 用于字符串搜索的 KMP 算法?

标签 java arrays algorithm string-algorithm string-operations

我在网上发现了这个非常具有挑战性的编码问题,我想尝试一下。

一般的想法是给定字符串T和模式P,找到这个模式的出现,总结它的对应值并返回最大值和最小值。如果您想更详细地阅读问题,请参阅 this

然而,下面是我提供的代码,它适用于一个简单的测试用例,但是当运行在多个复杂的测试用例上时它非常慢,我不确定我的代码在哪里需要优化。

任何人都可以帮助我在逻辑错误的地方。

public class DeterminingDNAHealth {


  private DeterminingDNAHealth() {
    /*
     * Fixme:
     *  Each DNA contains number of genes
     *   - some of them are beneficial and increase DNA's total health
     *   - Each Gene has a health value
     *   ======
     *   - Total health of DNA = sum of all health values of beneficial genes
     */
  }

  int checking(int start, int end, String pattern) {
    String[] genesChar = new String[] {
      "a",
      "b",
      "c",
      "aa",
      "d",
      "b"
    };
    String numbers = "123456";

    int total = 0;

    for (int i = start; i <= end; i++) {
      total += KMPAlgorithm.initiateAlgorithm(pattern, genesChar[i]) * (i + 1);
    }

    return total;
  }

  public static void main(String[] args) {

    String[] genesChar = new String[] {
      "a",
      "b",
      "c",
      "aa",
      "d",
      "b"
    };
    Gene[] genes = new Gene[genesChar.length];

    for (int i = 0; i < 6; i++) {
      genes[i] = new Gene(genesChar[i], i + 1);
    }

    String[] checking = "15caaab 04xyz 24bcdybc".split(" ");


    DeterminingDNAHealth DNA = new DeterminingDNAHealth();
    int i, mostHealthiest, mostUnhealthiest;

    mostHealthiest = Integer.MIN_VALUE;
    mostUnhealthiest = Integer.MAX_VALUE;

    for (i = 0; i < checking.length; i++) {
      int start = Character.getNumericValue(checking[i].charAt(0));
      int end = Character.getNumericValue(checking[i].charAt(1));
      String pattern = checking[i].substring(2, checking[i].length());

      int check = DNA.checking(start, end, pattern);

      if (check > mostHealthiest)
        mostHealthiest = check;
      else
      if (check < mostUnhealthiest)
        mostUnhealthiest = check;
    }

    System.out.println(mostHealthiest + " " + mostUnhealthiest);

    // DNA.checking(1,5, "caaab");
  }
}

KMP算法

public class KMPAlgorithm {

  KMPAlgorithm() {}


  public static int initiateAlgorithm(String text, String pattern) {

    // let us generate our LPC table from the pattern
    int[] partialMatchTable = partialMatchTable(pattern);

    int matchedOccurrences = 0;

    // initially we don't have anything matched, so 0
    int partialMatchLength = 0;

    // we then start to loop through the text, !note, not the pattern. The text that we are testing the pattern on
    for (int i = 0; i < text.length(); i++) {

      // if there is a mismatch and there's no previous match, then we've hit the base-case, hence break from while{...}
      while (partialMatchLength > 0 && text.charAt(i) != pattern.charAt(partialMatchLength)) {

        /*
         * otherwise, based on the number of chars matched, we decrement it by 1.
         * In fact, this is the unique part of this algorithm. It is this part that we plan to skip partialMatchLength
         * iterations. So if our partialMatchLength was 5, then we are going to skip (5 - 1) iteration.
         */
        partialMatchLength = partialMatchTable[partialMatchLength - 1];

      }


      // if however we have a char that matches the current text[i]
      if (text.charAt(i) == pattern.charAt(partialMatchLength)) {

        // then increment position, so hence we check the next char of the pattern against the next char in text
        partialMatchLength++;

        // we will know that we're at the end of the pattern matching, if the matched length is same as the pattern length
        if (partialMatchLength == pattern.length()) {
          // to get the starting index of the matched pattern in text, apply this formula (i - (partialMatchLength - 1))

          // this line increments when a match string occurs multiple times;
          matchedOccurrences++;

          // just before when we have a full matched pattern, we want to test for multiple occurrences, so we make
          // our match length incomplete, and let it run longer.
          partialMatchLength = partialMatchTable[partialMatchLength - 1];

        }
      }

    }

    return matchedOccurrences;


  }


  private static int[] partialMatchTable(String pattern) {
    /*
     * TODO
     *  Note:
     *  => Proper prefix: All the characters in a string, with one or more cut off the end.
     *  => proper suffix: All the characters in a string, with one or more cut off the beginning.
     *
     *  1.) Take the pattern and construct a partial match table
     *
     *  To construct partial match table {
     *      1. Loop through the String(pattern)
     *      2. Create a table of size String(pattern).length
     *      3. For each character c[i], get The length of the longest proper prefix in the (sub)pattern
     *         that matches a proper suffix in the same (sub)pattern
     *  }
     */

    // we will need two incremental variables
    int i, j;

    // an LSP table also known as “longest suffix-prefix”
    int[] LSP = new int[pattern.length()];


    // our initial case is that the first element is set to 0
    LSP[0] = 0;

    // loop through the pattern...
    for (i = 1; i < pattern.length(); i++) {

      // set our j as previous elements data (not the index)
      j = LSP[i - 1];


      // we will be comparing previous and current elements data. ei char
      char current = pattern.charAt(i), previous = pattern.charAt(j);

      // we will have a case when we're somewhere in loop and two chars will not match, and j is not in base case.
      while (j > 0 && current != previous)
        // we decrement our j
        j = LSP[j - 1];

      // simply put, if two characters are same, then we update our LSP to say that at that point, we hold the j's value
      if (current == previous)
        // increment our j
        j++;

      // update the table
      LSP[i] = j;


    }

    return LSP;

  }
}

来源代码归功于 Github

最佳答案

您可以尝试这个 KMP 实现。它是 O(m+n),正如 KMP 的意图。它应该快得多:

private static int[] failureFunction(char[] pattern) {
    int m = pattern.length;

    int[] f = new int[pattern.length];
    f[0] = 0;

    int i = 1;
    int j = 0;

    while (i < m) {
        if (pattern[i] == pattern[j]) {
            f[i] = j + 1;
            i++;
            j++;
        } else if (j > 0) {
            j = f[j - 1];
        } else {
            f[i] = 0;
            i++;
        }
    }
    return f;
}

private static int kmpMatch(char[] text, char[] pattern) {
    int[] f = failureFunction(pattern);

    int m = pattern.length;
    int n = text.length;

    int i = 0;
    int j = 0;

    while (i < n) {
        if (pattern[j] == text[i]) {
            if (j == m - 1){
                return i - (m - 1);
            } else {
                i++;
                j++;
            }
        } else if (j > 0) {
            j = f[j - 1];
        } else {
            i++;
        }
    }
    return -1;
}

关于java - 用于字符串搜索的 KMP 算法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46821536/

相关文章:

java - Spring MVC + Oracle LOB + 流式处理

javascript - 操作数组和对象

algorithm - 'Divide and conquer' 是否为未排序的数组产生 O(log N)?

algorithm - 标准化分数以对事物进行排名的最佳方法是什么?

java - StAX 错误地读取事件读取器

Java util Logger 未写入文件

java - 字符串的哈希码是唯一的吗?

python - numpy 数组中的索引,其中切片在另一个数组中

JavaScript:将变量与数组数组中的元素匹配

对排名结果进行元排名的算法