unicode - 对于ElasticSearch的ascii折叠 token 过滤器,unicode字符到前127个ASCII字符的映射是什么?

标签 unicode lucene elasticsearch

我们的产品利用了ascii折叠式 token 过滤器,我们的客户正在要求有关它的特定信息。具体来说,他们希望将Unicode字符映射到ASCII等价字符。虽然我相信大多数转换都是显而易见的(例如ü= u),但是有些像“ß”这样的“微妙”转换,我相信会转化为“ss”。

我已经在Google上搜索过,但无法找到确定的映射。我可以在某些地方获得此信息吗?

谢谢你的帮助,
埃里克

最佳答案

You can just read the source code for ASCIIFoldingFilter .

该来源的样本:

      case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
      case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
      case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
      case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
      case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
      case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
      case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
      case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
      case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
      case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
      case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
      case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
      case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
      case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
      case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
      case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
      case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
      case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
      case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
      case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
      case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
      case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
      case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
      case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
      case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
      case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
      case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
      case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
      case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
      case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
      case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
      case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
      case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
      case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
        output[outputPos++] = 'A';
        break;

如您所见,它对希腊字母和西里尔字母没有任何作用,更不用说其他字母了。

也。您猜对了,ß被转换为ss:
      case '\u00DF': // ß  [LATIN SMALL LETTER SHARP S]
        output[outputPos++] = 's';
        output[outputPos++] = 's';
        break;

关于unicode - 对于ElasticSearch的ascii折叠 token 过滤器,unicode字符到前127个ASCII字符的映射是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25576417/

相关文章:

python-2.7 - Python读取unicode文件名

mysql - 将文本字段从 latin1_swedish 转换为 utf8?

elasticsearch - 在特定时间内没有与RANGE查询完全匹配

java - Lucene查询语法中AND和+有什么区别

amazon-web-services - AWS Elastic Search中的部分字符串搜索以及多个单词

elasticsearch - Elasticsearch:多个单词同义词不会影响查询中的分数

java - 避免忽略 "/"之后的下一个字符

c++ - Boost Locale 边界分析不适用于 char16_t

elasticsearch - Lucene如何索引not_analyzed字段

python - 使用 Lucene (PyLucene) 查找单个字段项