考虑以下字符串:
Arg =“north_carolina_state_university”
Text = “哈克尼就读于北卡罗来纳州立大学,然后转学到北卡罗来纳大学教堂山分校,在那里获得学士和法学博士学位。他在 1971-74 年间担任检察官,之后进入私有(private)诊所。1974 年,他是国 session 员艾克·安德鲁斯的竞选经理。在北卡罗来纳大学教堂山分校读本科时,他写了关于北卡罗来纳州惩教系统历史的荣誉论文。”
我知道在文本中可以找到 Arg 的变体,但不一定相同,而且 Arg 可能会有噪音。
另一个例子如下:
Arg2 =“maurice_blackburn”
Text2 =“莫里斯·麦克雷·布莱克本(Maurice McCrae Blackburn,1880 年 11 月 19 日 - 1944 年 3 月 31 日),澳大利亚政治家和律师,出生于维多利亚州英格尔伍德。1887 年父亲去世后,他随母亲移居墨尔本。 1896 年就读于墨尔本文法学校。完成学业后,他进入墨尔本大学,于 1909 年毕业于艺术和法律专业,一年后开始从事律师工作。”
在上面的示例中,Arg2 中的中间名未在 Text2 中使用。
Arg3 =“堪萨斯城市大都会区” Text3 =“罗奇作为共和党人当选参加第六十七届和六十八届国会(1921年3月4日至1925年3月3日)。他担任司法部支出委员会主席(第六十八届国会) )。他是 1924 年第六十九届国会连任候选人,但未成功。1924 年 12 月 27 日,他搬到密苏里州圣路易斯,并恢复律师执业。他于 6 月 29 日在密苏里州堪萨斯城去世, 1934 年。他被埋葬在密苏里州罗奇附近的罗奇公墓。
在此示例中,“堪萨斯城”已出现在 Text3 中,但没有“大都市区”(因为它已出现在 Arg3 中)。
有什么函数/库可以发现文本中 Arg 的出现吗?
最佳答案
希望这个答案至少可以帮助您获得一些想法。我创建了一个方法来回答这个问题
Any function/library to discover the occurrence of the Arg in the text?
这是我使用上面的示例从我的方法收到的以下输出:
Arg = "north_carolina_state_university"
Text = "Hackney attended North Carolina State University before transferring to the University of North Carolina at Chapel Hill, where he earned bachelor's and Juris Doctor degrees. He worked as a prosecutor from 1971-74 before going into private practice. In 1974, he was campaign manager for Congressman Ike Andrews. While an undergraduate at UNC-Chapel Hill, he wrote his Honors Thesis on the history of the North Carolina corrections system."
Output
Match Results
Words:4/4
Letters:28/28
Arg2 = "maurice_blackburn"
Text2 = "Maurice McCrae Blackburn (19 November 1880 -- 31 March 1944), Australian politician and lawyer, was born in Inglewood, Victoria. He moved to Melbourne with his mother following the death of his father in 1887. He was educated at Melbourne Grammar School matriculating in 1896. After completing school, he attended the University of Melbourne, graduating in arts and law in 1909, and began to practice as a lawyer a year later."
Output
Match Results
Words:2/2
Letters:16/16
Arg3 = "kansas_city_metropolitan_area"
Text3 = "Roach was elected as a Republican to the Sixty-seventh and Sixty-eighth Congresses (March 4, 1921-March 3, 1925). He served as chairman of the Committee on Expenditures in the Department of Justice (Sixty-eighth Congress). He was an unsuccessful candidate for reelection in 1924 to the Sixty-ninth Congress. He moved to St. Louis, Missouri, December 27, 1924, and resumed the practice of law. He died at Kansas City, Missouri, June 29, 1934. He was interred in Roach Cemetery near Roach, Missouri".
Output
Match Results
Words:2/4
Letters:13/26
此方法仅搜索英文字母表,并且仅搜索单词(以空格分隔),也不搜索乱序的单词字母。如果您搜索 cat 并且有人输入 acat ,它将显示为不匹配,也不会显示为任何字母匹配。这是因为狗不是热狗。您确实必须决定您希望匹配的模糊程度。这段代码绝不是最好的,但我希望它能给您一些想法,也许可以将其重写得更加整洁有序。不管怎样,它确实回答了您提出的确切问题。
public static String search(String search, String target) {
String result = "";
search = search.toLowerCase();
target = target.toLowerCase();
StringBuilder temp = new StringBuilder();
ArrayList<String> searchWords = new ArrayList<String>();
ArrayList<String> targetWords = new ArrayList<String>();
char lastChar = ' ';
char currentChar = ' ';
// search,text
int swords, twords, sletters, tletters, mwords, mletters;
swords = twords = sletters = tletters = mwords = mletters = 0;
for (Character c : search.toCharArray()) {
currentChar = c > 96 && c < 123 ? c : ' ';
if (lastChar == ' ' && currentChar == ' ')
continue;
if (currentChar != ' ' && ++sletters != 0)
temp.append(currentChar);
else {
searchWords.add(temp.toString());
temp.setLength(0);
}
lastChar = currentChar;
}
searchWords.add(temp.toString());
temp.setLength(0);
lastChar = ' ';
for (Character c : target.toCharArray()) {
currentChar = c > 96 && c < 123 ? c : ' ';
if (lastChar == ' ' && currentChar == ' ')
continue;
if (currentChar != ' ' && ++tletters != 0)
temp.append(currentChar);
else {
targetWords.add(temp.toString());
temp.setLength(0);
}
lastChar = currentChar;
}
targetWords.add(temp.toString());
temp.setLength(0);
search = searchWords.toString();
target = targetWords.toString();
swords = searchWords.size();
twords = targetWords.size();
int[] blm = new int[searchWords.size()]; // best letter match
int lm = 0;// letter match
for (int i = 0; i < searchWords.size(); i++) {
for (String t : targetWords) {
for (int i2 = 0; i2 < (searchWords.get(i).length() < t
.length() ? searchWords.get(i).length() : t
.length()); i2++) {
if (t.charAt(i2) == searchWords.get(i).charAt(i2))
lm++;
}
if (blm[i] < lm)
blm[i] = lm;
lm = 0;
}
}
for (int i = 0; i < blm.length; i++) {
if (blm[i] == searchWords.get(i).length())
mwords++;
mletters += blm[i];
}
result = MessageFormat
.format("-----\nSearch text:\"{0}\"\nWords:{1}\nLetters:{2}\n-----\nTarget text:\"{3}\"\nWords:{4}\nLetters:{5}\n-----\nMatch Results\nWords:{6}/{1}\nLetters:{7}/{2}",
search, swords, sletters, target, twords, tletters,
mwords, mletters);
return result;
}
关于java - Java 中的嘈杂字符串匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23095751/