performance - 最长公共(public)子序列算法调试性能瓶颈

我正在使用向量库和状态 monad 在 Haskell 中编写最长公共(public)子序列算法(以封装 Miller O(NP) algorithm 的非常命令式和可变性)。我已经用 C 为我需要它的某个项目编写了它，现在我正在用 Haskell 编写它作为一种探索如何编写那些与 C 匹配的具有良好性能的命令式网格行走算法的方法。我用未装箱向量编写的版本对于相同的输入，它比 C 版本慢大约 4 倍(并且使用正确的优化标志编译 - 我同时使用系统时钟时间和 Criterion 方法来验证 Haskell 和 C 版本之间的相对时间测量，以及相同的数据类型，都很大和微小的输入)。我一直在试图找出性能问题可能出在哪里，并希望得到反馈——我可能在这里遇到了一些众所周知的性能问题，尤其是在我在这里大量使用的向量库中。

在我的代码中，我有一个叫做 gridWalk 的函数，它被调用得最频繁，而且还完成了大部分工作。性能下降很可能存在，但我无法弄清楚它可能是什么。完整的 Haskell 代码是 here .以下代码片段:

import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when) 
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int


type MVI1 s  = MVector (PrimState (ST s)) Int

cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
               where
                 n = U.length a
                 m = U.length b
                 go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
                                    | otherwise = len

-- function to find previous y on diagonal k for furthest point 
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
              let k0 = k+offset-1
                  k1 = k+offset+1
              y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
              y1 <- MU.unsafeRead fp k1
              if y0 > y1 then return (k0,y0)
              else return (k1,y1)
{-#INLINE findYP #-}

gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
   let !offset = 1+U.length a
   (!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset                          
   let xp = yp-k
       len = {-#SCC cmp #-} cmp a b xp yp
       x = xp+len
       y = yp+len

   {-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y  
   return ()
{-#INLINE gridWalk #-}

-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s ->  Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}

我添加了一些成本中心注释，并使用某个 LCS 输入运行分析以进行测试。这是我得到的:

  total time  =        2.39 secs   (2394 ticks @ 1000 us, 1 processor)
  total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

gridWalk    Main       67.5   52.7
findSnakes  Main       23.2   27.8
cmp         Main        4.2    0.0
findYP      Main        3.5   19.4
updateFP    Main        1.6    0.0


                                                         individual     inherited
COST CENTRE    MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN           MAIN                     64           0    0.0    0.0   100.0  100.0
 main          Main                    129           0    0.0    0.0     0.0    0.0
 CAF           Main                    127           0    0.0    0.0   100.0  100.0
  findSnakes   Main                    141           0    0.0    0.0     0.0    0.0
  main         Main                    128           1    0.0    0.0   100.0  100.0
   findSnakes  Main                    138           0    0.0    0.0     0.0    0.0
    gridWalk   Main                    139           0    0.0    0.0     0.0    0.0
     cmp       Main                    140           0    0.0    0.0     0.0    0.0
   while       Main                    132        4001    0.1    0.0   100.0  100.0
    findSnakes Main                    133       12000   23.2   27.8    99.9   99.9
     gridWalk  Main                    134    16004000   67.5   52.7    76.7   72.2
      cmp      Main                    137    16004000    4.2    0.0     4.2    0.0
      updateFP Main                    136    16004000    1.6    0.0     1.6    0.0
      findYP   Main                    135    16004000    3.5   19.4     3.5   19.4
   newVI1      Main                    130           1    0.0    0.0     0.0    0.0
   newVI1.\   Main                    131        8004    0.0    0.0     0.0    0.0
 CAF           GHC.Conc.Signal         112           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding         104           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding.Iconv   102           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Handle.FD         95           0    0.0    0.0     0.0    0.0

如果我正确解释了分析输出(并假设分析没有太多失真)，gridWalk占用大部分时间，但主要功能cmp和 findYP在 gridWalk 中完成繁重的工作，似乎在分析报告中花费的时间很少。所以，也许瓶颈在 forM_ wrapper findSnakes函数用于调用 gridWalk ?堆配置文件看起来也很干净: Heap profile

阅读核心，没有什么真正跳出来。我认为内部循环中的某些值可能会被装箱，但我没有在核心中发现它们。我希望性能问题是由于我错过了一些简单的事情。

更新

根据@DanielFischer 的建议，我替换了 forM_的 Data.Vector.Unboxed与 Control.Monad在 findSnakes将性能从 C 版本的 4 倍提高到 2.5 倍的函数。 Haskell 和 C 版本现已发布 here如果你想尝试一下。

我仍在挖掘核心，看看瓶颈在哪里。 gridWalk是最常调用的函数，并且要使其性能良好，lcsh应该减少 whileM_循环到一个很好的条件检查和内联迭代内部循环 findSnakes代码。我怀疑在组装中，whileM_ 的情况并非如此。循环，但由于我对翻译核心和在汇编中定位名称困惑的 GHC 函数不是很了解，我想这只是耐心地解决问题直到我弄清楚的问题。同时，如果有关于性能修复的任何指示，我们将不胜感激。

我能想到的另一种可能性是函数调用期间堆检查的开销。正如在分析报告中看到的，gridWalk被称为 16004000 次。假设堆检查有 6 个周期(我猜它更少，但仍然让我们假设)，在 3.33GHz 的盒子上，96024000 个周期大约是 0.02 秒。

此外，一些性能数据:
Haskell code (GHC 7.6.1 x86_64) : 在 forM_ 之前是 ~0.25 秒使固定。

 time ./T
1

real    0m0.150s
user    0m0.145s
sys     0m0.003s

C code (gcc 4.7.2 x86_64) :

time ./test
1

real    0m0.065s
user    0m0.063s
sys     0m0.000s

更新 2:

更新代码为 here .使用 STUArray也不会改变数字。在 Mac OS X (x86_64,ghc7.6.1) 上的性能大约是 1.5 倍，与@DanielFischer 在 Linux 上报道的非常相似。

haskell 代码:

$ time ./Diff
1

real    0m0.087s
user    0m0.084s
sys 0m0.003s

代码:

$ time ./test
1

real    0m0.056s
user    0m0.053s
sys 0m0.002s

扫一眼cmm ，调用是尾递归的，并被 llvm 转成循环.但是每个新的迭代似乎也分配了调用堆检查的新值，因此，可能解释了性能差异。我必须考虑如何以这样一种方式编写尾递归，以便在迭代之间不分配任何值，从而避免堆检查和分配开销。

最佳答案

你受到了巨大的打击

U.forM_ (U.fromList [0..ct-1])

在 findSnakes .我确信这不应该发生(票？)，但这分配了一个新的 Vector每次都要遍历findSnakes叫做。如果你使用

Control.Monad.forM_ [0 .. ct-1]

相反，运行时间大约减半，分配在这里下降了大约 500 倍。 (GHC优化C.M.forM_ [0 :: Int .. limit]好了，列表被淘汰了，剩下的基本就是一个循环了。)你自己写循环可以稍微好一点。

一些导致无故分配/代码大小膨胀而不会对性能造成太大影响的事情是

未使用的 Bool lcsh 的论据

cmp论据 findSnakes和 gridWalk ;如果这些从未使用与顶级 cmp 不同的比较来调用，该参数导致不必要的代码重复。

通用型while ;专门用于使用的类型 ST s Bool -> ST s () -> ST s ()减少分配(很多)，也减少运行时间(稍微，但很明显，在这里)。

关于性能分析的通用词:编译用于性能分析的程序会抑制许多优化。特别是对于像 vector 这样的库, bytestring或 text大量使用融合，分析往往会产生误导性的结果。

例如，您的原始代码在此处生成

    total time  =        3.42 secs   (3415 ticks @ 1000 us, 1 processor)
    total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.7   52.7   2176 2432608000
findSnakes  Main       20.0   27.8    682 1281440080
cmp         Main        9.2    0.0    313        16
findYP      Main        4.2   19.4    144 896224000
updateFP    Main        2.7    0.0     91         0

只需在 len 的绑定(bind)上添加一声爆炸在 gridWalk在非分析版本中没有任何改变，但对于分析版本

    total time  =        2.98 secs   (2985 ticks @ 1000 us, 1 processor)
    total alloc = 3,204,404,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.0   32.0   1881 1024256000
findSnakes  Main       22.2   40.0    663 1281440080
cmp         Main        7.2    0.0    214        16
findYP      Main        4.7   28.0    140 896224000
updateFP    Main        2.7    0.0     82         0

它有很大的不同。对于包括上述更改的版本(以及 len 中 gridWalk 的爆炸声)，分析版本说

total alloc = 1,923,412,776 bytes  (excludes profiling overheads)

但非分析版本

     1,814,424 bytes allocated in the heap
        10,808 bytes copied during GC
        49,064 bytes maximum residency (2 sample(s))
        25,912 bytes maximum slop
             1 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time (elapsed)  Avg pause  Max pause
Gen  0         2 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.12s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.12s  (  0.12s elapsed)

说它分配的比分析版本少 1000 倍。

对于 vector和 friend 的代码，比分析更可靠地识别瓶颈(不幸的是，它也更耗时和困难得多)正在研究生成的核心(或程序集，如果您精通阅读)。

关于更新，我的机器上的 C 运行速度稍慢(gcc-4.7.2，-O3)

$ time ./miltest1

real    0m0.074s
user    0m0.073s
sys     0m0.001s

但是 Haskell 差不多

$ time ./hsmiller
1

real    0m0.151s
user    0m0.149s
sys     0m0.001s

通过 LLVM 后端编译时速度会快一些:

$ time ./hsmiller1

real    0m0.131s
user    0m0.129s
sys     0m0.001s

当我们更换 forM_带有手动循环，

findSnakes a b fp !k !ct op = go 0
  where
    go x
        | x < ct    = gridWalk a b fp (op k x) >> go (x+1)
        | otherwise = return ()

它变得有点快，

$ time ./hsmiller
1

real    0m0.124s
user    0m0.121s
sys     0m0.002s

分别通过 LLVM:

$ time ./hsmiller
1

real    0m0.108s
user    0m0.107s
sys     0m0.000s

总的来说，生成的核心看起来不错，一个小烦恼是

Main.$wa
  :: forall s.
     GHC.Prim.Int#
     -> GHC.Types.Int
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s, Main.MVI1 s #)

和一个稍微迂回的实现。这是通过制作 newVI1 来修复的在第二个参数中严格，

newVI1 n !x = do

由于不经常调用它，因此对性能的影响当然可以忽略不计。

肉是核心lcsh ，而且看起来还不错。唯一的盒装东西是 Int s 读取/写入 STRef ，这是不可避免的。不那么令人愉快的是，核心包含大量代码重复，但根据我的经验，这很少是真正的性能问题，而且并非所有重复的代码都能在代码生成过程中幸存下来。

and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code.

当您添加 INLINE 时，您会得到一个内循环。 pragma to whileM_ ，但该循环并不好，在这种情况下，它比使用 whileM_ 慢得多。出线(我不确定这是否仅仅是由于代码大小，但可能是)。

关于performance - 最长公共(public)子序列算法调试性能瓶颈，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16952956/

performance - 最长公共(public)子序列算法调试性能瓶颈

上一篇：class-diagram - ER图和类图的区别

下一篇：apache2 - 带有通配符 VirtualDocumentRoot 的 VirtualHost