java - Java 中的嘈杂字符串匹配?

标签 java string pattern-matching substring

考虑以下字符串:

Arg =“north_carolina_state_university”

Text = “哈克尼就读于北卡罗来纳州立大学,然后转学到北卡罗来纳大学教堂山分校,在那里获得学士和法学博士学位。他在 1971-74 年间担任检察官,之后进入私有(private)诊所。1974 年,他是国 session 员艾克·安德鲁斯的竞选经理。在北卡罗来纳大学教堂山分校读本科时,他写了关于北卡罗来纳州惩教系统历史的荣誉论文。”

我知道在文本中可以找到 Arg 的变体,但不一定相同,而且 Arg 可能会有噪音。

另一个例子如下:

Arg2 =“maurice_blackburn”

Text2 =“莫里斯·麦克雷·布莱克本(Maurice McCrae Blackburn,1880 年 11 月 19 日 - 1944 年 3 月 31 日),澳大利亚政治家和律师,出生于维多利亚州英格尔伍德。1887 年父亲去世后,他随母亲移居墨尔本。 1896 年就读于墨尔本文法学校。完成学业后,他进入墨尔本大学,于 1909 年毕业于艺术和法律专业,一年后开始从事律师工作。”

在上面的示例中,Arg2 中的中间名未在 Text2 中使用。

Arg3 =“堪萨斯城市大都会区” Text3 =“罗奇作为共和党人当选参加第六十七届和六十八届国会(1921年3月4日至1925年3月3日)。他担任司法部支出委员会主席(第六十八届国会) )。他是 1924 年第六十九届国会连任候选人,但未成功。1924 年 12 月 27 日,他搬到密苏里州圣路易斯,并恢复律师执业。他于 6 月 29 日在密苏里州堪萨斯城去世, 1934 年。他被埋葬在密苏里州罗奇附近的罗奇公墓。

在此示例中,“堪萨斯城”已出现在 Text3 中,但没有“大都市区”(因为它已出现在 Arg3 中)。

有什么函数/库可以发现文本中 Arg 的出现吗?

最佳答案

希望这个答案至少可以帮助您获得一些想法。我创建了一个方法来回答这个问题

Any function/library to discover the occurrence of the Arg in the text?

这是我使用上面的示例从我的方法收到的以下输出:

Arg = "north_carolina_state_university"

Text = "Hackney attended North Carolina State University before transferring to the University of North Carolina at Chapel Hill, where he earned bachelor's and Juris Doctor degrees. He worked as a prosecutor from 1971-74 before going into private practice. In 1974, he was campaign manager for Congressman Ike Andrews. While an undergraduate at UNC-Chapel Hill, he wrote his Honors Thesis on the history of the North Carolina corrections system."

Output

Match Results

Words:4/4

Letters:28/28


Arg2 = "maurice_blackburn"

Text2 = "Maurice McCrae Blackburn (19 November 1880 -- 31 March 1944), Australian politician and lawyer, was born in Inglewood, Victoria. He moved to Melbourne with his mother following the death of his father in 1887. He was educated at Melbourne Grammar School matriculating in 1896. After completing school, he attended the University of Melbourne, graduating in arts and law in 1909, and began to practice as a lawyer a year later."

Output

Match Results

Words:2/2

Letters:16/16


Arg3 = "kansas_city_metropolitan_area"

Text3 = "Roach was elected as a Republican to the Sixty-seventh and Sixty-eighth Congresses (March 4, 1921-March 3, 1925). He served as chairman of the Committee on Expenditures in the Department of Justice (Sixty-eighth Congress). He was an unsuccessful candidate for reelection in 1924 to the Sixty-ninth Congress. He moved to St. Louis, Missouri, December 27, 1924, and resumed the practice of law. He died at Kansas City, Missouri, June 29, 1934. He was interred in Roach Cemetery near Roach, Missouri".

Output

Match Results

Words:2/4

Letters:13/26

此方法仅搜索英文字母表,并且仅搜索单词(以空格分隔),也不搜索乱序的单词字母。如果您搜索 cat 并且有人输入 acat ,它将显示为不匹配,也不会显示为任何字母匹配。这是因为狗不是热狗。您确实必须决定您希望匹配的模糊程度。这段代码绝不是最好的,但我希望它能给您一些想法,也许可以将其重写得更加整洁有序。不管怎样,它确实回答了您提出的确切问题。

public static String search(String search, String target) {
        String result = "";
        search = search.toLowerCase();
        target = target.toLowerCase();
        StringBuilder temp = new StringBuilder();
        ArrayList<String> searchWords = new ArrayList<String>();
        ArrayList<String> targetWords = new ArrayList<String>();
        char lastChar = ' ';
        char currentChar = ' ';
        // search,text
        int swords, twords, sletters, tletters, mwords, mletters;
        swords = twords = sletters = tletters = mwords = mletters = 0;

        for (Character c : search.toCharArray()) {
        currentChar = c > 96 && c < 123 ? c : ' ';
        if (lastChar == ' ' && currentChar == ' ')
            continue;
        if (currentChar != ' ' && ++sletters != 0)
            temp.append(currentChar);
        else {
            searchWords.add(temp.toString());
            temp.setLength(0);
        }
        lastChar = currentChar;
        }
        searchWords.add(temp.toString());
        temp.setLength(0);
        lastChar = ' ';
        for (Character c : target.toCharArray()) {
        currentChar = c > 96 && c < 123 ? c : ' ';
        if (lastChar == ' ' && currentChar == ' ')
            continue;
        if (currentChar != ' ' && ++tletters != 0)
            temp.append(currentChar);
        else {
            targetWords.add(temp.toString());
            temp.setLength(0);
        }
        lastChar = currentChar;
        }
        targetWords.add(temp.toString());
        temp.setLength(0);
        search = searchWords.toString();
        target = targetWords.toString();
        swords = searchWords.size();
        twords = targetWords.size();
        int[] blm = new int[searchWords.size()]; // best letter match
        int lm = 0;// letter match
        for (int i = 0; i < searchWords.size(); i++) {
        for (String t : targetWords) {
            for (int i2 = 0; i2 < (searchWords.get(i).length() < t
                .length() ? searchWords.get(i).length() : t
                .length()); i2++) {
            if (t.charAt(i2) == searchWords.get(i).charAt(i2))
                lm++;
            }
            if (blm[i] < lm)
            blm[i] = lm;
            lm = 0;
        }
        }

        for (int i = 0; i < blm.length; i++) {
        if (blm[i] == searchWords.get(i).length())
            mwords++;
        mletters += blm[i];
        }

        result = MessageFormat
            .format("-----\nSearch text:\"{0}\"\nWords:{1}\nLetters:{2}\n-----\nTarget text:\"{3}\"\nWords:{4}\nLetters:{5}\n-----\nMatch Results\nWords:{6}/{1}\nLetters:{7}/{2}",
                search, swords, sletters, target, twords, tletters,
                mwords, mletters);
        return result;
    }

关于java - Java 中的嘈杂字符串匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23095751/

相关文章:

java - 使用 DeltaSpike Security 时出现 ClassCastException?

java - 如何设置一个长整数的SeekBar进度? (或替代品)

javascript - 从字符串的开头删除特定字符串?

c++ - 有没有比迭代更好的方法在 C++ 中执行 URL 模式匹配?

java - 当使用带有 varargs 参数的构造函数时,如何强制至少必须传递一个参数?

java - 为什么要使用 getter 和 setter/访问器?

c# - 如何识别字符串中的路径

string - 增加VARCHAR的限制

bash - 在 unix/linux shell 中进行模式匹配时如何使用反向或负通配符?

sed - 如何使用 sed 取消注释第二个模式匹配上的多行?