parsing - 在haskell中快速解析大型utf-8文本文件

我有一个 300MB 的文件( link )，其中包含 utf-8 字符。我想编写一个等效于的 haskell 程序:

cat bigfile.txt | grep "^en " | wc -l

这在我的系统上运行时间为 2.6 秒。

现在，我正在将文件作为普通字符串(readFile)读取，并具有以下内容:

main = do
    contents <- readFile "bigfile.txt"
    putStrLn $ show $ length $ lines contents

几秒钟后，我收到此错误:

Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)

我想我需要使用更 utf-8 友好的东西？我怎样才能使它既快速又兼容 utf-8？我阅读了 Data.ByteString.Lazy 以提高速度，但 Real World Haskell 说它不支持 utf-8。

最佳答案

套餐utf8-string提供对读写 UTF8 字符串的支持。它重用了 ByteString基础设施，因此界面可能非常相似。

在 this Masters thesis 中讨论了另一个 Unicode 字符串项目，它很可能与上述相关并且也受到 ByteStrings 的启发。 .

关于parsing - 在haskell中快速解析大型utf-8文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8172889/

相关文章：

c# - 如何在 OpenXML 段落、运行、文本中保留带格式的字符串？