当你执行 s[n] 时,为什么你必须遍历字符串才能找到字符串的 nᵗʰ 字母,其中 s 是一个字符串。 (根据 https://doc.rust-lang.org/book/strings.html )
据我了解,字符串是字符数组,字符是 4 字节数组或 4 字节数。那么获取第 n 个字母是否类似于这样做:v[4*n..4*n+4] 其中 v 是一个向量?
v[i..j] 的成本是多少?
我假设 v[i..j] 的成本是 j-i,因此 s[n] 的成本应该是 4。
最佳答案
注意: The Rust Programming Language 第二版对Strings in Rust 进行了改进和流畅的解释。 ,您可能也想阅读。下面的答案虽然仍然准确,但引用了本书的第一版。
我将通过引用书中的内容 (https://doc.rust-lang.org/book/strings.html) 来澄清这些关于 Rust 中字符串的误解。
A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.
考虑到这一点,加上 UTF-8 代码点的大小可变(1 到 4 个字节取决于字符),Rust 中的所有字符串,无论它们是 &str
还是 String
,不是字符数组,不能这样处理。进一步解释了为什么在 Slicing 上:
Because strings are valid UTF-8, they do not support indexing:
let s = "hello"; println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.
与问题中提到的不同,不能执行s[n]
,因为虽然理论上这允许我们在常数时间内获取第 n 个字节,但不能保证该字节任何感觉。
v[i..j] 的成本是多少?
切片的成本实际上是恒定的,因为它是在字节级别完成的:
You can get a slice of a string with slicing syntax:
let dog = "hachiko"; let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公"; let hachi = &dog[0..2];
with this error:
thread '' panicked at 'index 0 and/or 2 in
忠犬ハチ公
do not lie on character boundary'
基本上,切片是可以接受的,并且会产生该字符串的新 View ,因此不会生成任何副本。但是,仅当您完全确定字符边界的偏移量正确时才应使用它。
为了遍历字符串的每个字符,您可以改为调用 chars()
:
let c = s.chars().nth(n);
即使考虑到这一点,请注意,如果您希望处理 UTF-8 中的字符修饰符(它们本身是标量值,但也不应单独处理),那么处理 Unicode 字符可能并不是您想要的。现在从 str
API 引用:
fn chars(&self) -> Chars
Returns an iterator over the chars of a string slice.
As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
Remember, chars may not match your human intuition about characters:
let y = "y̆"; let mut chars = y.chars(); assert_eq!(Some('y'), chars.next()); // not 'y̆' assert_eq!(Some('\u{0306}'), chars.next()); assert_eq!(None, chars.next());
unicode_segmentation crate 提供了一种定义字素簇边界的方法:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);
关于string - [] 字符串运算符,与向量切片链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38536207/