我试图在Rust中一次读取一行文件,并按照this question中的建议开始:
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let reader = BufReader::new(file);
for line in reader.lines() {
match line {
Ok(line) => println!("Ok: {}", line),
Err(error) => println!("Err: {}", error),
}
}
return Ok(());
}
但是,我有非UTF8文件。 Python chardet.universaldetector
库告诉我这是ISO-8859-1:Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
开箱即用,Rust无法解释具有非UTF8字符的行:$ ./target/release/main1
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8
所以我尝试了encoding_rs_io库。我在这里使用Windows 1252而不是ISO-8859-1,但似乎可以处理以下数据:use std::error::Error;
use std::fs::File;
use std::io::Read;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
let mut buffer = vec![];
reader.read_to_end(&mut buffer)?;
println!("{}", String::from_utf8(buffer).unwrap());
return Ok(());
}
这将成功读取UTF8字符:$ ./target/release/main2
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
但是,它没有lines()
方法,因此无法一次读取一行。我注意到ripgrep项目使用此库来解码非UTF8文件,并且我已经在调试器中进入其源代码。据我所知,它正在执行自己的手动滚动CR/LF检测。因此,肯定必须已经解决了在Rust中一次读取非UTF8文件的任务。我真的需要重新发明这个方向吗?感激不尽,不胜感激!
最佳答案
DecodeReaderBytes
implements io::Read
,因此您应该能够将其包装在 std::io::BufReader
中并使用其 lines
方法:
use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = BufReader::new(
DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file));
for line in reader.lines() {
println!("{}", line);
}
return Ok(());
}
关于utf-8 - 如何在Rust中逐行读取非UTF8文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64040851/