performance - 使用 Data.ByteString 实现 unix 的 "cat"程序的 Haskell 性能

我有以下 Haskell 代码，实现了“cat”unix 命令行实用程序的简单版本。在 400MB 文件上使用“时间”测试性能，速度大约慢 3 倍。 (我用来测试它的确切脚本位于代码下方)。

我的问题是:

这是对性能的有效测试吗？

我怎样才能让这个程序运行得更快？

一般来说，我如何识别 Haskell 程序中的性能瓶颈？

关于问题 2 和 3:我使用了 GHC -prof，然后使用 +RTS -p 运行，但我发现这里的输出有点缺乏信息。

来源(Main.hs)

module Main where

import System.IO
import System.Environment
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle buf = do
  hPut stdout buf
  eof <- hIsEOF handle
  unless eof $ do
    buf <- hGetSome handle bufsize
    go handle buf

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- hGetSome handle bufsize
  hSetBuffering stdin $ BlockBuffering (Just bufsize)
  hSetBuffering stdout $ BlockBuffering (Just bufsize)
  go handle buf

计时脚本(run.sh):

#!/usr/bin/env bash

# Generate 10M lines of silly test data
yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge

# Compile with optimisation
ghc -O2 Main.hs

# Run haskell
echo "timing Haskell"
time ./Main huge > /dev/null

echo ""
echo ""

# Run cat
echo "timing 'cat'"
time cat huge > /dev/null

我的结果:

timing Haskell

real    0m0.980s
user    0m0.296s
sys     0m0.684s


timing 'cat'

real    0m0.304s
user    0m0.001s
sys     0m0.302s

使用 -prof 编译并使用 +RTS -p 运行时的分析报告如下:

  Sat Dec 13 21:26 2014 Time and Allocation Profiling Report  (Final)

     Main +RTS -p -RTS huge

  total time  =        0.92 secs   (922 ticks @ 1000 us, 1 processor)
  total alloc = 7,258,596,176 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.Conc.Signal           84           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Handle.FD          81           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    76           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           70           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     69           0    0.0    0.0     0.0    0.0

最佳答案

这只是试图解决第二个问题的部分答案:

我使用 GHC.IO.Buffer 尝试了类似的操作接口(interface):

module Main where

import System.IO
import System.Environment
import GHC.IO.Buffer
import Data.ByteString as BS

import Control.Monad

-- Copied from cat source code
bufsize = 1024*128

go handle bufPtr = do
  read <- hGetBuf handle bufPtr bufsize
  when (read > 0) $ do
    hPutBuf stdout bufPtr read
    go handle bufPtr

main = do
  file    <- fmap Prelude.head getArgs
  handle  <- openFile file ReadMode
  buf     <- newByteBuffer bufsize WriteBuffer

  withBuffer buf $ go handle

它似乎更接近“猫”的表现，但仍然肯定更慢......

time ./Cat huge > /dev/null 
./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 total

time cat huge > /dev/null  
cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total

我认为使用缓冲区 API，我们可以在使用 hGetSome 时显式避免分配所有缓冲区字节串。就像在原始代码中一样，但我只是在这里猜测并且不知道两个编译代码中到底发生了什么......

更新:在我的笔记本电脑上添加原始代码的性能:

time ./Cat2 huge > /dev/null
./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total

更新 2:添加一些基本的分析结果:

原始代码:

Cat2 +RTS -p -RTS huge

    total time  =        0.21 secs   (211 ticks @ 1000 us, 1 processor)
    total alloc = 6,954,068,112 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0  100.0


                                                       individual     inherited
COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0
 CAF        GHC.IO.Handle.FD          86           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal           82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding           80           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.FD                 79           0    0.0    0.0     0.0    0.0
 CAF        System.Posix.Internals    75           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv     72           0    0.0    0.0     0.0    0.0

缓冲区 API 代码:

Cat +RTS -p -RTS huge

    total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)
    total alloc =   3,487,712 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

MAIN        MAIN    100.0   98.9


                                                      individual     inherited
COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                     44           0  100.0   98.9   100.0  100.0
 CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0
 CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1
 CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0

特别注意分配成本的巨大差异......

关于performance - 使用 Data.ByteString 实现 unix 的 "cat"程序的 Haskell 性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27463669/

performance - 使用 Data.ByteString 实现 unix 的 "cat"程序的 Haskell 性能

上一篇：javascript - 同一窗口事件的多个监听器如何影响性能？

下一篇：validation - 根据复选框值有条件地进行输入