JavaCC 和 Unicode 问题。为什么\u696d属于 "\u4e00"-"\u9fff"范围,但在JavaCC中无法管理

标签 java unicode compiler-construction antlr javacc

我们正在尝试使用 JavaCC 作为解析器来解析 UTF-8(语言是日语)的源代码。在 JavaCC 中,我们有这样的声明:

< #LETTER:
  [
   "\u0024",
   "\u0041"-"\u005a",
   "\u005f",
   "\u0061"-"\u007a",
   "\u00c0"-"\u00d6",
   "\u00d8"-"\u00f6",
   "\u00f8"-"\u00ff",
   "\u0100"-"\u1fff",
   "\u3040"-"\u318f",
   "\u3300"-"\u337f",
   "\u3400"-"\u3d2d",
   "\u4e00"-"\u9fff",
   "\uf900"-"\ufaff"
  ]
>

如果遇到像“日建フェンス工业”这样的字符串,就会因为业字符而失败。如果我删除它,它会按预期工作。业字符的代码是“\u696d”,从声明中可以看出,它应该属于“\u4e00”-“\u9fff”的范围

对此有什么建议吗?

PS:如果我们用Antlr重写这个语法,会是什么样子

非常感谢

最佳答案

您的 token 片段没有问题,JavaCC 也没有问题。问题出在别处。

这是通过将您的问题代码复制并粘贴到 JavaCC 中而制定的 JavaCC 规范。

options {
  static = true;
  debug_token_manager = true ; }

PARSER_BEGIN(MyNewGrammar)
package funnyunicode;
import java.io.StringReader ;

public class MyNewGrammar
{
  public static void main(String args []) throws ParseException
  {
    MyNewGrammar parser = new MyNewGrammar(new StringReader("日建フェンス工業"));
    MyNewGrammar.go() ;
    System.out.println("OK."); } }
PARSER_END(MyNewGrammar)

TOKEN :
{
  < WORD : (<LETTER>)+ >
|
  < #LETTER:
  [
   "\u0024",
   "\u0041"-"\u005a",
   "\u005f",
   "\u0061"-"\u007a",
   "\u00c0"-"\u00d6",
   "\u00d8"-"\u00f6",
   "\u00f8"-"\u00ff",
   "\u0100"-"\u1fff",
   "\u3040"-"\u318f",
   "\u3300"-"\u337f",
   "\u3400"-"\u3d2d",
   "\u4e00"-"\u9fff",
   "\uf900"-"\ufaff"
  ] >
}

void go() :
{Token tk ; }
{
  tk=<WORD> <EOF>
}

这是生成的 Java 程序的输出

Current character : \u65e5 (26085) at line 1 column 1
   Starting NFA to match one of : { <WORD> }
Current character : \u65e5 (26085) at line 1 column 1
   Currently matched the first 1 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u5efa (24314) at line 1 column 2
   Currently matched the first 2 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u30d5 (12501) at line 1 column 3
   Currently matched the first 3 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u30a7 (12455) at line 1 column 4
   Currently matched the first 4 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u30f3 (12531) at line 1 column 5
   Currently matched the first 5 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u30b9 (12473) at line 1 column 6
   Currently matched the first 6 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u5de5 (24037) at line 1 column 7
   Currently matched the first 7 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
Current character : \u696d (26989) at line 1 column 8
   Currently matched the first 8 characters as a <WORD> token.
   Possible kinds of longer matches : { <WORD> }
****** FOUND A <WORD> MATCH (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******

Returning the <EOF> token.

OK.

如您所见,生成的分词器可以毫无问题地将 \u696d 视为 LETTER

关于JavaCC 和 Unicode 问题。为什么\u696d属于 "\u4e00"-"\u9fff"范围,但在JavaCC中无法管理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30933785/

相关文章:

c# - 为什么默认情况下只有文字字符串保存在实习生池中?

java - 尝试获取新插入行的 id 时出现异常

java - 如何从 Java 使用 Telnet

Qt 和 unicode 转义字符串

.net - 在库 : How to add DLL reference to component when component is located in project hosting the library? 中进行代码内编译

algorithm - 如何确保命名参数列表的评估顺序和形式参数顺序?

java - 如何设置没有@id 元素的@entity?

Java - 生成一组不重复的随机数

python - 由于 ASN1 值,提交给 IIS CA 的 CSR 失败

android - AccessibilityService 可以调度关键事件,甚至包括 Unicode 字符吗?