python - Python 中的字符是如何编码的？

来自 Dive into Python :

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

我不明白作者的意思。

当我说 s = 'hello' 时，s 是如何在内部编码的？当然它必须使用一些使用一些编码。他说所有的字符串都是 Unicode 字符的序列。但是每个字符有多少字节呢？这个字符串是UTF-8吗？他为什么说:“没有以 UTF-8 编码的 Python 字符串这样的东西”。

我知道 Python 提供了将 Python“字符串”转换为一系列字节的功能，这些字节可以被使用该编码的其他软件读取。它还支持将一系列字节转换为 Python“字符串”。现在这个“字符串”的内部表示让我感到困惑。

最佳答案

When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding.

这取决于。坦率地说，没关系。 CPython 现在使用 Flexible String Representation ，一个美妙的空间和时间优化。但你不应该关心，因为这无关紧要。

He says all strings are sequences of Unicode characters. But how many bytes is each character?

不知道。这取决于。在那种特殊情况下，它可能是 Latin-1(1 字节)(当使用 CPython 时)。

Is this string UTF-8?

没有。

Why does he say : "There is no such thing as a Python string encoded in UTF-8".

因为是一系列的Unicode码点。如果您将编码与字符串混淆(因为其他语言经常强制您这样做)，您可能会认为 'Jalape\xc3\xb1o' 是 'Jalapeño'，因为在 UTF-8 中，字节序列 '\xc3\xb1o' 表示 'ñ'。但它不是，因为字符串没有固有编码，就像数字100是数字100，而不是4，是否或者你不是用二进制、十进制或一元来表示它。

他这么说是因为人们来自只有代表字符串的字节的语言，他们认为“但是这是如何编码的”，就好像他们必须自己解码一样。这就像携带 1 和 0 的列表而不是能够使用数字，你必须告诉每个函数你正在使用什么字节顺序。

I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.

希望它不再是 :)。

如果这让您感到困惑，我推荐 this question ，部分原因是因为有人称我的回答“非常全面”¹，但也因为 Steven D'Aprano 在那里发布了他的 Python 邮件列表卓越之一 - 他和我从列表中回答并发布了我们的文本。

如果您想知道为什么它是相关的，我会引用:

So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all.

这不正是你的困惑吗？

¹ 从技术上讲，他称另一个答案为“另一个非常全面的答案”，但这暗示了我刚才所说的；)。

关于python - Python 中的字符是如何编码的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18913462/

python - Python 中的字符是如何编码的？

上一篇：python - 计算 Pandas 时间序列上的每日事件

下一篇：python - matplotlib 中的低对比度图像(对比度拉伸(stretch))问题