c++ - 标识符字符集 (clang)

标签 c++ clang identifier

我从不使用 clang。

而且无意中发现了这段代码:

#include <iostream>

void функция(int переменная)
{
    std::cout << переменная << std::endl;
}

int main()
{
    int русская_переменная = 0;
    функция(русская_переменная);
}

将编译正常:http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).

它是 clang 扩展吗??为什么? 谢谢。

UPD:我想问的是为什么 clang 会做出这样的决定?因为我从来没有发现有人想要比 c++ 标准现在更多的字符的讨论(2.3,修订版 3691)

最佳答案

与其说它是一个扩展,不如说它是 Clang 对标准的多字节字符 部分的解释。 Clang 支持 UTF-8 源代码文件。

至于为什么,我想“为什么不呢?”是唯一真正的答案;支持更大的字符集对我来说似乎有用且合理。

以下是标准(C11 草案)的相关部分:

5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet

A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 lowercase letters of the Latin alphabet

a b c d e f g h i j k l m
n o p q r s t u v w x y z

the 10 decimal digits

0 1 2 3 4 5 6 7 8 9

the following 29 graphic characters

! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.

4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.

5 The universal character name construct provides a way to name other characters.

还有:

5.2.1.2 Multibyte characters

1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

— The basic character set shall be present and each character shall be encoded as a single byte.

— The presence, meaning, and representation of any additional members is locale- specific.

— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

2 For source files, the following shall hold:

— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.

— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.

关于c++ - 标识符字符集 (clang),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24469809/

相关文章:

c++ - 使用 C++ 查找 RAM 数量

c++ - 存储参数化类的所有有效实例是否可以?

c++ - C++ 如何处理 try catch block 中的赋值?

Clang 使用 -nostdlib 生成崩溃代码

iphone - 以编程方式设置 UIBarButtonItem 的标识符

c++ - switch case 语句在 Arduino/C++ 中的工作原理

swift - 后端错误 : invalid llvm. linker.options 在 Ubuntu 18.10 上构建 SourceKit-LSP

list - 问问有很多人在 Netlogo 中做某事的海龟

c++ - 预处理器 : Meaning of "The definition also permits you to split an identifier at any position and get exactly two tokens"

c++ - 柔性和 Bison :C++ class