javascript - 为什么美元符号不再是 "intended for use only in mechanically generated code?"

在 ECMA-262, 3rd edition^[PDF] ，在第 7.6 节(“标识符”，第 26 页)下，我们看到以下注释:

The dollar sign is intended for use only in mechanically generated code.

这似乎很合理。许多常用于生成或嵌入 JavaScript 的语言对 $ 具有特殊含义，并且在这些语言中的 JavaScript 标识符中使用它会导致 unexpected behavior .

“机械生成的子句”出现在第 2 版中。在第 1 版中，它不存在。从第 5 版开始，它再次消失且没有任何解释，并且仍然存在 absent摘自第六版工作草案。

如果我不得不猜测，我会认为它最初被省略是因为没有考虑到潜在的陷阱，然后在很明显它会导致问题时在下一版本中添加。不过，我想不出在第 5 版中再次删除它的充分理由。

对于规范中包含和随后删除“机械生成的条款”(来自邮件列表、新闻组或其他地方的“书面记录”)是否有任何解释？我在任何地方都找不到此记录。

作为一个附带问题，任何人都可以解释 including zero-width characters 背后的基本原理吗？在第6版草案中？这似乎会造成更多麻烦，因为您根本看不到这些字符，而且我想不出您希望在标识符中使用这些字符的任何原因。

更新:最初包含的“机械生成的代码”注释和零宽度字符的包含在下面的 codewaggle 答案中进行了解释。唯一需要回答的是这个问题的主要焦点，“机械生成的代码”注释的删除。

最佳答案

这是一个开始:Subject: SC22 N2745 - Disposition of Comments Report on DIS 16262 -ECMAScript

似乎添加了“仅应用于机械生成的代码”，因为这是 JAVA 的规范。

D6) 7.5: DOLLAR SIGN should not be in the identifier list, according to recommendations in TR 10176. 7.5 should refer to the "i18n" specification of ISO/IEC 14652 for definitions of letters and digits.

>>>>>> Action: Partial acceptance --- ECMAScript follows Java precedent. A comment will add that $ should only be used for mechanically-generated code. <<<<<

如果您想仔细阅读过去的 session 记录，可以查看此处:
ecmascript wiki: Notes and Minutes from past meetings

关于后续更改:
所有这些都来自邮件列表“es5-discuss -- Discussion of ECMAScript 3.x”。

ZWNJ and ZWJ in identifiers (was: Comments on April ES5 final draft standard tc39-2009-025)

约翰·考恩写道:

It turns out that Unicode 5.1 has done the heavy lifting: the bad news is that the lifting is indeed heavy. You want to allow Cf characters if and only if they actually make a semantic distinction in contemporary use. That turns out, says Unicode 5.1, to allow only U+200C and U+200D and then only in certain contexts: the rules involve knowing the Script and Joining_Type properties of nearby identifier characters. Details at http://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters .

大卫-莎拉·霍普伍德回复:

What is the down-side of simply adding U+200C and U+200D to IdentifierPart without any additional context-sensitive rules?

I think that it is the combined responsibility of input methods and of programmers to ensure that <ZWNJ> and <ZWJ> characters are used as intended in identifiers; all that a programming language syntax needs to do is to allow them.

Note that the goal of "excluding as many cases as possible where no visible distinction results" (supposedly for security reasons) is not really applicable, since ECMAScript does not enforce even NFC normalization. To not enforce NFC but to add considerable complexity to the grammar, as UTR #31 suggests, in order to prevent some potential (but relatively harmless, AFAICS) misuses of <ZWNJ> and <ZWJ>, seems like an inconsistent set of design choices to me.

这个将一堆讨论集中在一起:Last call for consensus on format-control char. issues

对此有 15 条回复，您可能需要阅读这些回复:
https://mail.mozilla.org/pipermail/es5-discuss/2009-June/thread.html#2832

艾伦·维尔夫斯-布洛克写道:

Waldemar's notes from the May F2F don't record any decision on the issue of <ZWNJ> and <ZWJ> in identifiers. However, my personal notes say that I need to "keep in identifiers and fix grammar" which is also my recollection of what we decided at the meeting.

The simplest implementation of that decisions is to simply add <ZWNJ> and <ZWJ> as alternatives for IdentifierPart. In addition, the text in section 7.1 that says that format control characters can occur in identifier presumably needs to be narrowed to say only <ZWNJ> and <ZWJ>.

At about the same time as the F2F David-Sarah made a more comprehensive proposal (duplicated below) that in addition to addressing <ZWNJ> and <ZWJ> also significantly refines the rules for <BOM> including excluding them from strings literals and regular expressions and making it a syntax error for a <BOM> to appear within an identifier.

I'm not a Unicode expert, but my sense is that David-Sarah's proposal is sound and probably consistent with the original goals of cleaning up class Cf in the specification. However, his rules for <BOM> also seem like they could significantly complicate the lexical analysis phase of implementations.

My sense from the F2F is that the consensus was more in the direction of my simple solution above (<ZWNJ> and <ZWJ> in identifiers, <BOM> is whitespace) rather than David-Sarah's more comprehensive treatment of <BOM>.

I need to have a final decision on this so I can update the draft accordingly. Based upon my recollection of the F2F I'm going to go with the "simple solution" unless there is apparent consensus otherwise.

Final thoughts?

他回复的消息，根据消息引用分为几 block :

-----Original Message----- From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss- bounces at mozilla.org] On Behalf Of David-Sarah Hopwood Sent: Thursday, May 28, 2009 5:44 PM To: es5-discuss at mozilla.org Subject: Grammar for IdentifierName does not allow <ZWNJ> and <ZWJ>

John Cowan wrote:
David-Sarah Hopwood scripsit:

The omission of format-control characters from <IdentifierName> appears to be just an oversight.

-1

中断

Indeed, I had forgotten that we had already discussed this and come to a different conclusion:

https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002432.html https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.

中断
Allowing all of them causes the same kinds of problems as allowing BOM. Most of them have little visible effect on the surrounding text (especially Latin-script text) even in fully conformant Unicode renderers, never mind renderers that muffle them. The result is that "foobar" and "foo<Cf>bar" look the same but aren't.

Per Unicode 5.1, the only ones that actually affect the natural- language meaning of identifiers are U+200C ZWNJ and U+200D ZWJ. These are the only ones which should even be considered in ES5 identifiers. UAX #31 (which is included by reference in Unicode 5.1) specifies narrower conditions in which ZWNJ and ZWJ are essential; sticking to the conditions is non-trivial, but minimizes the chance of spoofing.

Given the risks, I'm uncertain whether ZWNJ and ZWJ should be allowed or not.

中断
Forget trying to minimize identifier spoofing as a security risk. That's not possible, if Unicode identifiers are to be allowed at all. It is an inherent characteristic of Unicode that many distinct (even when normalized) strings will look the same. It is not at all clear that this is a genuine security risk for general programming -- as opposed to situations that require adversarial code review, which full ECMAScript is a long way from being able to support.

What is useful to attempt to minimize is the chance of accidentally typing identifiers that are distinct but look the same, or of seeing an identifier and being unable to reliably reproduce it. This is a usability issue, not a security issue.

For usability, it may indeed be a good approach to allow <ZWNJ> and <ZWJ> but disallow other format-control characters. I am not sufficiently familiar with the scripts that require these characters to be sure of that, but it seems reasonable based on their descriptions in the Unicode standard.

However, the complicated script-dependent rules described in UAX #31 for restricting the contexts in which <ZWNJ> and <ZWJ> can occur, seem quite over-the-top given the impossibility of preventing spoofing. Again, see https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.

Combining the proposal from that post with the changes for <NEL>, <ZWSP> and <BOM> (since both affect section 7.1), we end up with this.

==== Changes to section 7.2: - revert the addition of <NEL>, <ZWSP>, and <BOM> to WhiteSpace and to the table.

Changes to section 7.8.4:

DoubleStringCharacter :: SourceCharacter but not double-quote " or backslash \ or LineTerminator or <BOM> \ EscapeSequence LineContinuation

SingleStringCharacter :: SourceCharacter but not single-quote ' or backslash \ or LineTerminator or <BOM> \ EscapeSequence LineContinuation

NonEscapeCharacter :: SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>

The CV of DoubleStringCharacter :: SourceCharacter but not double-quote " or backslash \ or LineTerminator or <BOM> is the SourceCharacter character itself

The CV of SingleStringCharacter :: SourceCharacter but not single-quote ' or backslash \ or LineTerminator or <BOM> is the SourceCharacter character itself.

The CV of NonEscapeCharacter :: SourceCharacter but not EscapeCharacter or LineTerminator or <BOM> is the SourceCharacter character itself.

Replace section 7.1:

7.1 Unicode Format-Control Characters

The Unicode format-control characters (i.e., the characters in General Category "Cf" in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this, such as mark-up languages.

<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files.

In ECMAScript source, <BOM> characters are ignored if they appear immediately before or after a token, or within a span of consecutive WhiteSpace characters (7.2). The lexical grammar does not explicitly include such ignored <BOM> characters. It is a syntax error for a <BOM> character to appear within a token (that is, if removing the <BOM> would result in the preceding and following characters being part of the same token).

Note that comments are not tokens, and so the above rule allows <BOM> characters to appear within comments. It does not allow them to appear within string literals or regular expression literals (the escape sequence \uFEFF should be used instead).

It is useful to allow other format-control characters in source text to facilitate editing and display. Format-control characters other than <BOM> may be used within comments, string literals, and regular expression literals. Two specific format-control characters, <ZWNJ> and <ZWJ>, may also be used in an identifier after the first character.
  Code Unit Value    Name                                Formal name


  \u200C             Zero width non-joiner               <ZWNJ>
  \u200D             Zero width joiner                   <ZWJ>
  \uFEFF             Byte order mark (also called
                       zero-width non-breaking space)    <BOM>
Changes to section 7.6:

[...] This standard specifies specific character additions: The dollar sign ($) and the underscore (_) are permitted anywhere in an identifier. <ZWNJ> and <ZWJ> are permitted after the first character.

Changes to section 7.8.5:

RegularExpressionNonTerminator :: SourceCharacter but not LineTerminator or <BOM>

Changes to Annex A: - update all productions changed above.

Changes to Annex E: - add to the entry for section 7.1: characters are ignored between tokens and in comments, but are not allowed within tokens (including string and regular expression literals). <ZWNJ> and <ZWJ> are significant within identifiers rather than being stripped.

delete the entries for sections 7.2 and 15.10.2.12.

(Reverting the additions of <NEL>, <ZWSP>, and <BOM> to the WhiteSpace production also reverts this for the \s character class, without any explicit change to section 15.10.2.12.)

-- David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com

es5-discuss mailing list es5-discuss at mozilla.org https://mail.mozilla.org/listinfo/es5-discuss
我不会尝试将所有这些放在一起并给您一个简洁的答案，也许其他人会，并且您可以接受这个答案，请将此视为起点。

最后一个链接:
The August 2009 archive has the initial draft and release candidate 1 discussions for ES5.

关于javascript - 为什么美元符号不再是 "intended for use only in mechanically generated code?"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16454175/

javascript - 为什么美元符号不再是 "intended for use only in mechanically generated code?"

上一篇：javascript - 使用 jQuery 在 TextArea 中显示 xml

下一篇：javascript - Google 图片背后的 JavaScript 是如何工作的？