我试图理解字符串文字到最终字符串值(由代码单元值组成)的翻译,遵循 ECMAScript 2017 .
相关摘录
5.1.2 词法和正则表达式语法
A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.
Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.
5.1.4 句法语法
When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.
和
11 ECMAScript 语言:词汇语法
The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
11.8.4 字符串文字
StringLiteral ::
" DoubleStringCharacters_opt "
' SingleStringCharacters_opt '
SingleStringCharacters ::
SingleStringCharacter SingleStringCharacters_opt
SingleStringCharacter ::
SourceCharacter but not one of ' or \ or LineTerminator
\ EscapeSequence
LineContinuation
EscapeSequence ::
CharacterEscapeSequence
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
UnicodeEscapeSequence
CharacterEscapeSequence ::
SingleEscapeCharacter
NonEscapeCharacter
NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator
EscapeCharacter ::
SingleEscapeCharacter
DecimalDigit
x
u
11.8.4.3 静态语义:SV
A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.
和
The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.
The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.
问题
假设我们有字符串文字
'b\ar'
.我现在想按照上面的词汇语法和语义语法,把字符串文字变成一组代码单元值。b\ar
被识别为 CommonToken b\ar
被进一步识别为 StringLiteral \
的 SingleStringCharacter infront 被翻译成 SourceCharacter \a
被识别为\EscapeSequence any Unicode code point
我遇到的问题是 StringLiteral 输入元素现在是:
SourceCharacter, \ SourceCharacter, SourceCharacter
没有 SV 规则\源字符 , 仅适用于 \转义符 .
这让我想知道我的顺序是否错误,或者误解了词汇和句法语法是如何应用的。
我也对如何完全应用 SV 规则感到困惑。因为它们被定义为应用于非终结符号,而不是终结符号(应该是应用词法语法后的结果)。
任何帮助都深表感谢。
最佳答案
好吧,假设我们使用单个 token 'b\ar'
,也就是你所说的 StringLiteral
token 。应用 11.8.4.3 Static Semantics: SV 中定义的算法以及 10.1.1 Static Semantics: UTF16Encoding(cp) ,我们关注SV
规则:
StringLiteral::
的 SV '
SingleStringCharacters
'
是 SV
的 SingleStringCharacters
.SV
,因此打开引号就在SingleStringCharacters
部分,例如SV(b\ar)
SV
的 SingleStringCharacters::
SingleStringCharacterSingleStringCharacters
是一个或两个代码单元的序列,即 SV
的 SingleStringCharacter
后跟 SV
中的所有代码单元的 SingleStringCharacters
为了。这表示“每隔
SingleStringCharacter
调用 SV| 附加结果”。SV(b)
SV
的 SingleStringCharacter::
SourceCharacter
但不是 '
之一或 \
或 LineTerminator
是 UTF16Encoding
SourceCharacter
的码位值.\x0062
所以这里的结果本质上是一个16位单元的代码单元序列\x0062
SV(\a)
SV
的 SingleStringCharacter::
\
EscapeSequence
是 SV
的EscapeSequence
.SV(EscapeSequence)
这个SV(a)
(无 \
前缀)SV
的 EscapeSequence::
CharacterEscapeSequence
是 SV
的CharacterEscapeSequence
.SV(a)
SV
的 CharacterEscapeSequence::
NonEscapeCharacter
是 SV
的NonEscapeCharacter
.SV
的 NonEscapeCharacter::
SourceCharacter
但不是 EscapeCharacter
之一或 LineTerminator
是 UTF16Encoding
SourceCharacter 的代码点值。\x0061
, 所以这会产生一个只有 \x0061
的单单元序列. SV(r)
SV(b)
相同的步骤这会产生一个包含 \x0072
的单单元序列。 . SV(b) + SV(\a) + SV(r)
回到一起,字符串的值是 UTF16 代码单元的序列 [\x0062, \x0061, \x0072]
.该代码单元序列导致bar
. 编辑:
I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?
从词法分析器的 Angular 来看,“ token ”是
StringLiteral
,其中的所有内容都只是有关如何解析的信息。 EscapeSequence
不是一种 token 。SV
定义如何将 StringLiteral 标记分解为一系列代码单元。正如 11 ECMAScript Language: Lexical Grammar 中所述
The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
这些“输入元素”是解析器语法使用的标记。
Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?
不仅仅是值,还有数据类型。使用 Flow/Typescript 风格的注解,你可以想像上面的步骤
SV
的 SingleStringCharacter::
\
EscapeSequence
是 SV
的EscapeSequence
. SV
的 EscapeSequence::
CharacterEscapeSequence
是 SV
的CharacterEscapeSequence
. SV
的 CharacterEscapeSequence::
NonEscapeCharacter
是 SV
的NonEscapeCharacter
. SV
的 NonEscapeCharacter::
SourceCharacter
但不是 EscapeCharacter
之一或 LineTerminator
是 UTF16Encoding
SourceCharacter 的代码点值。 好像它是一个重载的函数,例如
function SV(parts: ["\", EscapeSequence]) {
return SV(parts[1]);
}
function SV(parts: [CharacterEscapeSequence]) {
return SV(parts[0]);
}
function SV(parts: [NonEscapeCharacter]) {
return SV(parts[0]);
}
function SV(parts: [SourceCharacter]) {
return UTF16Encoding(parts[0]);
}
所以
SV(a)
有点像 SV("a": [CharacterEscapeSequence])
而SV(b)
有不同的类型。
关于javascript - ECMAScript 2017 : Parsing from nonterminal StringLiteral to String values,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49641927/