grammar - .parse anchor 还是 :sigspace first in a Perl 6 rule?

标签 grammar raku

我有两个问题。我表现出的行为是否正确?如果是,它是否记录在某处?

我正在使用语法TOP方法。声明为规则,它意味着字符串的开头和结尾 anchor 以及:sigspace:

grammar Number {
    rule TOP { \d+ }
    }

my @strings = '137', '137 ', ' 137 ';

for @strings -> $string {
    my $result = Number.parse( $string );
    given $result {
        when Match { put "<$string> worked!" }
        when Any   { put "<$string> failed!" }
        }
    }

如果没有空格或只有尾随空格,字符串将被解析。由于前导空格,它会失败:

<137> worked!
<137 > worked!
< 137 > failed!

我认为这意味着规则首先应用:sigspace,然后应用 anchor :

grammar Foo {
    regex TOP { ^ :sigspace \d+ $ }
    }

我希望有一个规则允许前导空格,如果您切换顺序就会发生这种情况:

grammar Foo {
    regex TOP { :sigspace ^  \d+ $ }
    }

我可以在 rule 中为字符串的开头添加一个显式标记:

grammar Number {
    rule TOP { ^ \d+ }
    }

现在一切正常:

<137> worked!
<137 > worked!
< 137 > worked!

我没有任何理由认为应该是其中一种方式。 Grammars docs说发生了两件事,但文档没有说明这些效果应用的顺序:

Note that if you're parsing with .parse method, token TOP is automatically anchored

When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.


我认为答案是该规则实际上并不是以模式意义为基础的。这就是 .parse 的工作方式。光标必须从位置 0 开始,到字符串中的最后一个位置结束。这是模式之外的东西。

最佳答案

该行为是有意为之的,并且是这些语言功能的顶峰:

  • Sigspace 忽略第一个原子之前的空格。

    来自设计文档1( S05: Regexes and Rules, line 348 ,添加强调):

    The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.

    这意味着:

    rule TOP { \d+ }
                  ^-------- <.ws> automatically inserted
    
    rule TOP { ^ \d+ $ }
                ^---^-^---- <.ws> automatically inserted
    
  • Regexes are first-class compiled code with lexical scoping.

    A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.

    Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:

    The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
  • The anchoring of rule TOP is done at run time by .parse.

    S05, line 44231:

    The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)

    I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.

    It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).

So when you write

rule TOP { \d+ }

它被编译为

regex TOP { :r \d+ <.ws> }

当你 .parse该语法,它有效地调用正则表达式代码 ^ <TOP> $ , anchor 不属于 TOP 的一部分的词法作用域,而不是仅仅调用例程 TOP 的作用域。 。组合行为就像规则 TOP已写为:

regex TOP { ^ [:r :s \d+] $ }

1) 一般而言,设计文档不应被视为 Perl 6 语言的组成部分或不组成部分的福音,但 S05 在这方面相当准确,只是它提到了一些特性:尚未实现,但已计划实现。任何想要真正理解 Perl 6 正则表达式/语法的复杂性的人,在我看来,至少从上到下阅读完整的 S05 一次会很有帮助。

关于grammar - .parse anchor 还是 :sigspace first in a Perl 6 rule?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41626627/

相关文章:

parsing - 如何为规则中可以匹配多个项目的每一项执行 ANTLR 解析器操作?

c - 这个C BNF语法不完整吗?

perl - 如何像 Raku 一样在 Perl 中运行 shell 命令?

c - "FOLLOW_set_in_"... 在生成的解析器中未定义

c - 为什么不能将任意表达式用作数组大小,例如整数[0,1]?

java - 在 Java 中解析 ad-hoc if/else 语法的推荐策略?

parsing - 如何编写可定制的语法?

arguments - Perl 6 block 是一个参数还是一个参数?

r - 在 Rmarkdown 中执行 Perl 6 代码

regex - perl6 正则表达式子规则和命名正则表达式比显式正则表达式慢得多;如何使它们同样快?