如何让 KeywordAnalyzer
识别像 Müller 这样的名字,而不管拼写如何?
KeywordAnalyzer
需要完全匹配,我希望它匹配 Müller,但也匹配 Mueller(ue 二元组)和穆勒。
最佳答案
下面的自定义分析器可以解决这个问题:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
public final class KeywordAnalyzerDE extends Analyzer {
public KeywordAnalyzerDE() {
}
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer source = new KeywordTokenizer();
TokenStream result;
result = new GermanNormalizationFilter(source);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
关键是GermanNormalizationFilter
:
It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
- 'ß' is replaced by 'ss'
- 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
- 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
- 'ue' is replaced by 'u', when not following a vowel or q.
我添加了ASCIIFoldingFilter
,以防处理后的文本中存在其他变音符号。
查看源代码确实很有帮助:
关于java - KeywordAnalyzer 用于处理带有变音符号的单词的不同拼写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60579871/