javascript - ECMAScript 2017 : Parsing from nonterminal StringLiteral to String values

标签 javascript parsing ecmascript-6 lexical-analysis ecmascript-2017

我试图理解字符串文字到最终字符串值(由代码单元值组成)的翻译,遵循 ECMAScript 2017 .

相关摘录

5.1.2 词法和正则表达式语法

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.



5.1.4 句法语法

When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.





11 ECMAScript 语言:词汇语法

The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.



11.8.4 字符串文字
StringLiteral ::
    " DoubleStringCharacters_opt "
    ' SingleStringCharacters_opt '

SingleStringCharacters ::
    SingleStringCharacter SingleStringCharacters_opt

SingleStringCharacter ::
    SourceCharacter but not one of ' or \ or LineTerminator
    \ EscapeSequence
    LineContinuation

EscapeSequence ::
    CharacterEscapeSequence
    0 [lookahead ∉ DecimalDigit]
    HexEscapeSequence
    UnicodeEscapeSequence

CharacterEscapeSequence ::
    SingleEscapeCharacter
    NonEscapeCharacter

NonEscapeCharacter ::
    SourceCharacter but not one of EscapeCharacter or LineTerminator

EscapeCharacter ::
    SingleEscapeCharacter
    DecimalDigit
    x
    u

11.8.4.3 静态语义:SV

A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.





The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.

The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.



问题

假设我们有字符串文字 'b\ar' .我现在想按照上面的词汇语法和语义语法,把字符串文字变成一组代码单元值。
  • b\ar被识别为 CommonToken
  • b\ar被进一步识别为 StringLiteral
  • StringLiteral 被翻译成 SingleStringCharacters
  • SingleStringCharacters 中的每个代码点都被转换为 SingleStringCharacter
  • 每个没有 \ 的 SingleStringCharacter infront 被翻译成 SourceCharacter
  • \a被识别为\EscapeSequence
  • EscapeSequence (a) 被翻译成 NonEscapeCharacter
  • NonEscapeCharacter 被翻译成 SourceCharacter
  • 所有 SourceCharacter 都被翻译成 any Unicode code point
  • 最后,应用 SV 规则来获取字符串值和代码单元值

  • 我遇到的问题是 StringLiteral 输入元素现在是:
    SourceCharacter, \ SourceCharacter, SourceCharacter
    

    没有 SV 规则\源字符 , 仅适用于 \转义符 .

    这让我想知道我的顺序是否错误,或者误解了词汇和句法语法是如何应用的。

    我也对如何完全应用 SV 规则感到困惑。因为它们被定义为应用于非终结符号,而不是终结符号(应该是应用词法语法后的结果)。

    任何帮助都深表感谢。

    最佳答案

    好吧,假设我们使用单个 token 'b\ar' ,也就是你所说的 StringLiteral token 。应用 11.8.4.3 Static Semantics: SV 中定义的算法以及 10.1.1 Static Semantics: UTF16Encoding(cp) ,我们关注SV规则:

  • StringLiteral:: 的 SV ' SingleStringCharacters 'SVSingleStringCharacters .
  • 由于我们在递归运行 SV,因此打开引号就在SingleStringCharacters部分,例如SV(b\ar)
  • SVSingleStringCharacters:: SingleStringCharacterSingleStringCharacters是一个或两个代码单元的序列,即 SVSingleStringCharacter后跟 SV 中的所有代码单元的 SingleStringCharacters为了。

    这表示“每隔 SingleStringCharacter 调用 SV| 附加结果”。
  • SV(b)
  • SVSingleStringCharacter:: SourceCharacter但不是 ' 之一或 \LineTerminatorUTF16Encoding SourceCharacter 的码位值.
  • 代码点“b”是代码单元 \x0062所以这里的结果本质上是一个16位单元的代码单元序列\x0062
  • SV(\a)
  • SVSingleStringCharacter:: \ EscapeSequenceSVEscapeSequence .
  • 本质上 SV(EscapeSequence)这个SV(a) (无 \ 前缀)
  • SVEscapeSequence:: CharacterEscapeSequenceSVCharacterEscapeSequence .
  • 基本上只是通过SV(a)
  • SVCharacterEscapeSequence:: NonEscapeCharacterSVNonEscapeCharacter .
  • 更多直通
  • SVNonEscapeCharacter:: SourceCharacter但不是 EscapeCharacter 之一或 LineTerminatorUTF16Encoding SourceCharacter 的代码点值。
  • 代码点“a”是代码单元 \x0061 , 所以这会产生一个只有 \x0061 的单单元序列.
  • SV(r)
  • 遵循与 SV(b) 相同的步骤这会产生一个包含 \x0072 的单单元序列。 .
  • 合并序列SV(b) + SV(\a) + SV(r)回到一起,字符串的值是 UTF16 代码单元的序列 [\x0062, \x0061, \x0072] .该代码单元序列导致bar .

  • 编辑:

    I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?



    从词法分析器的 Angular 来看,“ token ”是 StringLiteral ,其中的所有内容都只是有关如何解析的信息。 EscapeSequence不是一种 token 。
    SV定义如何将 StringLiteral 标记分解为一系列代码单元。

    正如 11 ECMAScript Language: Lexical Grammar 中所述

    The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.



    这些“输入元素”是解析器语法使用的标记。

    Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?



    不仅仅是值,还有数据类型。使用 Flow/Typescript 风格的注解,你可以想像上面的步骤
  • SVSingleStringCharacter:: \ EscapeSequenceSVEscapeSequence .
  • SVEscapeSequence:: CharacterEscapeSequenceSVCharacterEscapeSequence .
  • SVCharacterEscapeSequence:: NonEscapeCharacterSVNonEscapeCharacter .
  • SVNonEscapeCharacter:: SourceCharacter但不是 EscapeCharacter 之一或 LineTerminatorUTF16Encoding SourceCharacter 的代码点值。

  • 好像它是一个重载的函数,例如
    function SV(parts: ["\", EscapeSequence]) {
        return SV(parts[1]);
    }
    function SV(parts: [CharacterEscapeSequence]) {
        return SV(parts[0]);
    }
    function SV(parts: [NonEscapeCharacter]) {
        return SV(parts[0]);
    }
    function SV(parts: [SourceCharacter]) {
        return UTF16Encoding(parts[0]);
    }
    

    所以SV(a)有点像 SV("a": [CharacterEscapeSequence])SV(b)有不同的类型。

    关于javascript - ECMAScript 2017 : Parsing from nonterminal StringLiteral to String values,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49641927/

    相关文章:

    javascript - 复选框标签的放置

    javascript - 使用 iPad 倾斜横向 move 图像

    php - 如何创建包含html代码的json对象

    java - openrdf 芝麻 : Is it possible to parse single lines?

    javascript - 在node.js ES6中,是否可以传入一个类型然后实例化它?

    javascript - 查找二维数组中的索引

    javascript - 如何预加载谷歌地图的特定区域?

    javascript - 单击输入字段时更改 div 背景颜色

    windows - 使用批处理脚本进行字符串解析

    javascript - JavaScript ECMAScript 6 中符号的用途是什么?