string - [] 字符串运算符，与向量切片链接

当你执行 s[n] 时，为什么你必须遍历字符串才能找到字符串的 nᵗʰ 字母，其中 s 是一个字符串。 (根据 https://doc.rust-lang.org/book/strings.html )

据我了解，字符串是字符数组，字符是 4 字节数组或 4 字节数。那么获取第 n 个字母是否类似于这样做:v[4*n..4*n+4] 其中 v 是一个向量？

v[i..j] 的成本是多少？

我假设 v[i..j] 的成本是 j-i，因此 s[n] 的成本应该是 4。

最佳答案

注意: The Rust Programming Language 第二版对Strings in Rust 进行了改进和流畅的解释。，您可能也想阅读。下面的答案虽然仍然准确，但引用了本书的第一版。

我将通过引用书中的内容 (https://doc.rust-lang.org/book/strings.html) 来澄清这些关于 Rust 中字符串的误解。

A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.

考虑到这一点，加上 UTF-8 代码点的大小可变(1 到 4 个字节取决于字符)，Rust 中的所有字符串，无论它们是 &str 还是 String ，不是字符数组，不能这样处理。进一步解释了为什么在 Slicing 上:

Because strings are valid UTF-8, they do not support indexing:
let s = "hello";

println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.

与问题中提到的不同，不能执行s[n]，因为虽然理论上这允许我们在常数时间内获取第 n 个字节，但不能保证该字节任何感觉。

v[i..j] 的成本是多少？

切片的成本实际上是恒定的，因为它是在字节级别完成的:

You can get a slice of a string with slicing syntax:
let dog = "hachiko";
let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];
with this error:

thread '' panicked at 'index 0 and/or 2 in 忠犬ハチ公 do not lie on character boundary'

基本上，切片是可以接受的，并且会产生该字符串的新 View ，因此不会生成任何副本。但是，仅当您完全确定字符边界的偏移量正确时才应使用它。

为了遍历字符串的每个字符，您可以改为调用 chars():

let c = s.chars().nth(n);

即使考虑到这一点，请注意，如果您希望处理 UTF-8 中的字符修饰符(它们本身是标量值，但也不应单独处理)，那么处理 Unicode 字符可能并不是您想要的。现在从 str API 引用:

fn chars(&self) -> Chars

Returns an iterator over the chars of a string slice.

As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.

Remember, chars may not match your human intuition about characters:
let y = "y̆";

let mut chars = y.chars();

assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());

unicode_segmentation crate 提供了一种定义字素簇边界的方法:

extern crate unicode_segmentation;

use unicode_segmentation::UnicodeSegmentation;

let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);

关于string - [] 字符串运算符，与向量切片链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38536207/

string - [] 字符串运算符，与向量切片链接

上一篇：arrays - 为什么我不能对数组声明使用类型推断？

下一篇：regex - 在当前范围内找不到类型 `unwrap` 的名为 `regex::re::Regex` 的方法