parsing - 在 ANTLR 中以非贪婪的方式匹配特定数量的重复

在我的语法中，我有这样的内容:

line : startWord (matchPhrase|
                  anyWord matchPhrase|
                  anyWord anyWord matchPhrase|
                  anyWord anyWord anyWord matchPhrase|
                  anyWord anyWord anyWord anyWord matchPhrase) 
       -> ^(TreeParent startWord anyWord* matchPhrase);

所以我想匹配 matchPhrase 的第一次出现，但我最多允许在其之前出现一定数量的 anyWord。组成 matchPhrase 的标记也与 anyWord 匹配。

有更好的方法吗？

我认为通过组合语义谓词 in this answer 可能是可能的使用非贪婪选项:

(options {greedy=false;} : anyWord)*

但我不知 Prop 体该怎么做。

编辑:这是一个示例。我想从以下句子中提取信息:

Picture of a red flower.

Picture of the following: A red flower.

我的输入实际上是带标签的英语句子，Lexer 规则匹配的是标签而不是单词。所以ANTLR的输入是:

NN-PICTURE Picture IN-OF of DT a JJ-COLOR red NN-FLOWER flower

NN-PICTURE Picture IN-OF of DT the VBG following COLON : DT a JJ-COLOR red NN-FLOWER flower

我对每个标签都有词法分析器规则，如下所示:

WS :  (' ')+ {skip();};
TOKEN : (~' ')+;

nnpicture:'NN-PICTURE' TOKEN -> ^('NN-PICTURE' TOKEN);
vbg:'VBG' TOKEN -> ^('VBG' TOKEN);

我的解析器规则是这样的:

sentence : nnpicture inof matchFlower;

matchFlower : (dtTHE|dt)? jjcolor? nnflower;

但是，这在第二句中当然会失败。所以我想允许一点灵活性，在花匹配之前允许最多 N 个 token 。我有一个可以匹配任何内容的 anyWord token ，并且可以执行以下操作:

sentence :  nnpicture inof ( matchFlower | 
                             anyWord matchFlower |
                             anyWord anyWord matchFlower | etc.

但它不是很优雅，并且不适用于大 N。

最佳答案

您可以通过首先检查 matchFlower 规则内部是否真的是 dt 来做到这一点？ jj颜色？使用 syntactic predicate 在其 token 流中提前 nnflower 。如果可以看到这样的标记，则简单地匹配它们，如果没有，则匹配任何标记，并递归匹配matchFlower。这看起来像:

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           . matchFlower         -> matchFlower
 ;

请注意，解析器规则内的 .(点)不匹配任何字符，但匹配任何标记。

这是一个快速演示:

grammar T;

options {
  output=AST;
}

tokens {
  TEXT;
  SENTENCE;
  FLOWER;
}

parse
 : sentence+ EOF -> ^(TEXT sentence+)
 ;

sentence
 : nnpicture inof matchFlower -> ^(SENTENCE nnpicture inof matchFlower)
 ;

nnpicture
 : NN_PICTURE TOKEN -> ^(NN_PICTURE TOKEN)
 ;

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           . matchFlower         -> matchFlower
 ;

inof
 : IN_OF (t=IN | t=OF) -> ^(IN_OF $t)
 ;

dt
 : DT (t=THE | t=A) -> ^(DT $t)
 ;

jjcolor
 : JJ_COLOR TOKEN -> ^(JJ_COLOR TOKEN)
 ;

nnflower
 : NN_FLOWER TOKEN -> ^(NN_FLOWER TOKEN)
 ;

IN_OF      : 'IN-OF';
NN_FLOWER  : 'NN-FLOWER';
DT         : 'DT';
A          : 'a';
THE        : 'the';
IN         : 'in';
OF         : 'of';
VBG        : 'VBG';
NN_PICTURE : 'NN-PICTURE';
JJ_COLOR   : 'JJ-COLOR';
TOKEN      : ~' '+;
WS         : ' '+ {skip();};

根据上述语法生成的解析器将解析您的输入:

NN-PICTURE Picture IN-OF of DT the VBG following COLON : DT a JJ-COLOR red NN-FLOWER flower

如下:

enter image description here

如您所见，树中省略了花之前的所有内容。如果您想将这些 token 保留在那里，请执行以下操作:

grammar T;

// ...

tokens {
  // ...
  NOISE;
}

// ...

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           t=. matchFlower       -> ^(NOISE $t) matchFlower
 ;

// ...

生成以下 AST:

enter image description here

关于parsing - 在 ANTLR 中以非贪婪的方式匹配特定数量的重复，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9689974/

parsing - 在 ANTLR 中以非贪婪的方式匹配特定数量的重复

上一篇：ios5 - iOS 信用卡处理

下一篇：playframework - 为什么 Play 1.x 不缓存以前运行的预编译代码？