python - 标识符规范化 : Why is the micro sign converted into the Greek letter mu?

我只是偶然发现了以下奇怪的情况:

>>> class Test:
        µ = 'foo'

>>> Test.µ
'foo'
>>> getattr(Test, 'µ')
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    getattr(Test, 'µ')
AttributeError: type object 'Test' has no attribute 'µ'
>>> 'µ'.encode(), dir(Test)[-1].encode()
(b'\xc2\xb5', b'\xce\xbc')

我输入的字符始终是键盘上的 µ 符号，但由于某种原因它被转换了。为什么会这样？

最佳答案

这里涉及到两个不同的角色。一种是MICRO SIGN ，这是键盘上的一个，另一个是GREEK SMALL LETTER MU .

要了解发生了什么，我们应该看看 Python 如何在 language reference 中定义标识符。 :

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">

我们的字符 MICRO SIGN 和 GREEK SMALL LETTER MU 都是 Ll unicode 组(小写字母)的一部分，因此它们都可以在标识符中的任何位置使用。现在注意identifier的定义实际上是指xid_start和xid_continue，它们被定义为各自的非x定义中的所有字符NFKC 规范化导致标识符的有效字符序列。

Python 显然只关心 规范化 形式的标识符。这一点在下面得到了证实:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

NFKC 是 Unicode normalization将字符分解为单独的部分。 MICRO SIGN 分解为希腊小写字母 MU，这正是那里发生的事情。

还有很多其他字符也会受到这种规范化的影响。另一个例子是 OHM SIGN分解为 GREEK CAPITAL LETTER OMEGA .使用它作为标识符会产生类似的结果，这里使用 locals 显示:

>>> Ω = 'bar'
>>> locals()['Ω']
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    locals()['Ω']
KeyError: 'Ω'
>>> [k for k, v in locals().items() if v == 'bar'][0].encode()
b'\xce\xa9'
>>> 'Ω'.encode()
b'\xe2\x84\xa6'

所以说到底，这只是 Python 所做的事情。不幸的是，实际上并没有一种很好的方法来检测这种行为，从而导致出现如图所示的错误。通常，当标识符仅被称为标识符时，即像真正的变量或属性一样使用时，一切都会好起来的:每次都运行规范化，并找到标识符。

唯一的问题是基于字符串的访问。字符串只是字符串，当然不会发生标准化(这只是个坏主意)。这里显示的两种方式，getattr和 locals , 两者都对字典进行操作。 getattr() 通过对象的 __dict__ 访问对象的属性，locals() 返回一个字典。而且在字典中，键可以是任何字符串，所以里面有一个 MICRO SIGN 或一个 OHM SIGN 是完全可以的。

在这些情况下，您需要记住自己执行规范化。我们可以利用unicodedata.normalize为此，这也允许我们从 locals() 内部(或使用 getattr)正确获取我们的值:

>>> normalized_ohm = unicodedata.normalize('NFKC', 'Ω')
>>> locals()[normalized_ohm]
'bar'

关于python - 标识符规范化 : Why is the micro sign converted into the Greek letter mu?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34097193/

python - 标识符规范化 : Why is the micro sign converted into the Greek letter mu?

上一篇：python - 捕获子进程输出

下一篇：python - 为什么 Pandas 应用计算两次