go - 一个字符可以在 Go 中跨越多个 rune 吗?

标签 go character-encoding

我在 this blog 上阅读了这篇文章

Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.

这是真的吗? (看起来像是懂 Go 的人的博客)。我在我的机器上测试过,“è”是 1 个 rune 和 2 个字节。和 Go doc似乎另有说法。

你遇到过这样的角色吗? (utf-8) 一个字符可以在 Go 中跨越多个 rune 吗?

最佳答案

是的,它可以:

s := "é́́"
fmt.Println(s, []rune(s))

输出(在 Go Playground 上尝试):

é́́ [101 769 769 769]

一个字符,4 个 rune 。它可以是任意长...

示例取自 The Go Blog: Text Normalization in Go .

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.

一个字符后面可以跟任意数量的modifiers (修饰符可以重复堆叠):

Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.

另见:Combining character .

编辑: “这不会扼杀‘ rune 概念’吗?”

答:不是 rune 的概念。 rune 不是字符。 rune 是标识 Unicode 代码点的整数值。一个字符可能是一个Unicode代码点,在这种情况下1个字符是1个runerune 的大部分一般用途都适合这种情况,因此在实践中这几乎不会让人头疼。这是 Unicode standard 的概念.

关于go - 一个字符可以在 Go 中跨越多个 rune 吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36569018/

相关文章:

email - 如何在 Go 中验证电子邮件地址

arrays - golang : gzip or zlib compression of byte array sporadically hangs

go - 为什么 Go 无法正确读取请求?

php - 不使用utf8编码显示汉字?

MySQL 字符编码更改。是否保留了数据完整性?

go - 在协调功能中检测规范更新

go - append() 在原子/线程中是安全的吗?

python - 使用 Python 读取 UTF8 CSV 文件

php - 将 ASCII 码存储在 mysql 数据库中

xml - 如何从 xml 文件中删除非法字符?