python - python 中 unicode 字符串的补充代码点

unichr(0x10000) 在编译 cpython 时未使用 --enable-unicode=ucs4 失败并出现 ValueError。

是否有一种语言内置函数或核心库函数可以将任意 unicode 标量值或代码点转换为 unicode 字符串，无论程序运行在哪种 python 解释器上，该字符串都能正常工作？

最佳答案

是的，给你:

>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'

要理解的关键点是 unichr() 将整数转换为 Python 解释器字符串编码中的单个代码单元。 The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr()阅读，

Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.

我强调了“一个字符”，他们的意思是 "one code unit" in Unicode terms .

我假设您使用的是 Python 2.x。 Python 3.x 解释器没有内置的 unichr() 函数。相反 The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr()阅读，

Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).

请注意，返回值现在是未指定长度的字符串，而不是具有单个代码单元的字符串。所以在 Python 3.x 中，chr(0x10000) 的行为与您预期的一样。它“将任意 unicode 标量值或代码点转换为 unicode 字符串，无论程序运行在哪种 python 解释器上，该字符串都有效”。

但回到 Python 2.x。如果您使用 unichr() 创建 Python 2.x unicode 对象，并且您使用的是高于 0xFFFF 的 Unicode 标量值，那么您就是在提交代码以了解Python 解释器对 unicode 对象的实现。

您可以使用一个函数来隔离这种意识，该函数在标量值上尝试 unichr()，捕获 ValueError，然后使用相应的 UTF-16 代理项对再次尝试:

def unichr_supplemental(scalar):
     try:
         return unichr(scalar)
     except ValueError:
         return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
               +unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )

>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)

但是您可能会发现将标量转换为 UTF-32 字节 string 中的 4 字节 UTF-32 值，并将该字节 string 解码为一个 unicode 字符串:

>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)

上面的代码是在 Python 2.6.7 上用 UTF-16 编码对 Unicode 字符串进行测试的。我没有在对 Unicode 字符串使用 UTF-32 编码的 Python 2.x 解释器上测试它。但是，它应该可以在任何具有任何 Unicode 字符串实现的 Python 2.x 解释器上正常工作。

关于python - python 中 unicode 字符串的补充代码点，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9284199/

python - python 中 unicode 字符串的补充代码点

上一篇：python - Pygame 删除背景图像

下一篇：python - 我需要对我的 python 代码做些什么才能让它成为一个模块？