performance - 为什么通过分而治之对 Data.Sequence 求和更快,即使没有并行性?

标签 performance haskell

我在 Data.Sequence.Seq 上玩并行缩减,并且我注意到即使没有并行性,分治法也具有速度优势。有谁知道为什么?

这是我的代码:

import qualified Data.Sequence as S
import qualified Data.Foldable as F

import System.Random
import Control.DeepSeq
import Criterion.Main
import Test.QuickCheck
import Control.Exception ( evaluate )

instance (Arbitrary a) => Arbitrary (S.Seq a) where
    arbitrary = fmap S.fromList arbitrary

instance NFData a => NFData (S.Seq a) where
    rnf = F.foldr seq ()

funs :: [(String, S.Seq Int -> Int)]
funs =
    [ ("seqDirect"   , seqDirect)
    , ("seqFoldr"    , seqFoldr)
    , ("seqFoldl'"   , seqFoldl')
    , ("seqSplit  1" , (seqSplit  1))
    , ("seqSplit  2" , (seqSplit  2))
    , ("seqSplit  4" , (seqSplit  4))
    , ("seqSplit  8" , (seqSplit  8))
    , ("seqSplit 16" , (seqSplit 16))
    , ("seqSplit 32" , (seqSplit 32)) ]

main :: IO ()
main = do
    mapM_ (\(_,f) -> quickCheck (\xs -> seqDirect xs == f xs)) funs
    gen <- newStdGen
    let inpt = S.fromList . take 100000 $ randoms gen
    evaluate (rnf inpt)
    defaultMain [ bench n (nf f inpt) | (n,f) <- funs ]

seqDirect :: S.Seq Int -> Int
seqDirect v = case S.viewl v of
    S.EmptyL -> 0
    x S.:< xs -> x + seqDirect xs

seqFoldr :: S.Seq Int -> Int
seqFoldr = F.foldr (+) 0

seqFoldl' :: S.Seq Int -> Int
seqFoldl' = F.foldl' (+) 0

seqSplit :: Int -> S.Seq Int -> Int
seqSplit 1 xs = seqFoldr xs
seqSplit _ xs | S.null xs = 0
seqSplit n xs =
    let (a, b) = S.splitAt (S.length xs `div` n) xs
        sa = seqFoldr a
        sb = seqSplit (n-1) b
    in  sa + sb

结果:

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.0.4

$ ghc --make -O2 -fforce-recomp -rtsopts seq.hs
[1 of 1] Compiling Main             ( seq.hs, seq.o )
Linking seq ...

$ ./seq +RTS -s
./seq +RTS -s
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
warming up
estimating clock resolution...
mean is 5.882556 us (160001 iterations)
found 2368 outliers among 159999 samples (1.5%)
  2185 (1.4%) high severe
estimating cost of a clock call...
mean is 85.26448 ns (44 iterations)
found 4 outliers among 44 samples (9.1%)
  3 (6.8%) high mild
  1 (2.3%) high severe

benchmarking seqDirect
mean: 23.37511 ms, lb 23.01101 ms, ub 23.77594 ms, ci 0.950
std dev: 1.953348 ms, lb 1.781578 ms, ub 2.100916 ms, ci 0.950

benchmarking seqFoldr
mean: 25.60206 ms, lb 25.39648 ms, ub 25.80034 ms, ci 0.950
std dev: 1.030794 ms, lb 926.7246 us, ub 1.156656 ms, ci 0.950

benchmarking seqFoldl'
mean: 10.65757 ms, lb 10.29087 ms, ub 10.99869 ms, ci 0.950
std dev: 1.819595 ms, lb 1.703732 ms, ub 1.922018 ms, ci 0.950

benchmarking seqSplit  1
mean: 25.50376 ms, lb 25.29045 ms, ub 25.71225 ms, ci 0.950
std dev: 1.075497 ms, lb 961.5707 us, ub 1.229739 ms, ci 0.950

benchmarking seqSplit  2
mean: 18.15032 ms, lb 17.62943 ms, ub 18.66413 ms, ci 0.950
std dev: 2.652232 ms, lb 2.288088 ms, ub 3.044585 ms, ci 0.950

benchmarking seqSplit  4
mean: 10.48334 ms, lb 10.14152 ms, ub 10.87061 ms, ci 0.950
std dev: 1.869274 ms, lb 1.690063 ms, ub 1.997915 ms, ci 0.950

benchmarking seqSplit  8
mean: 5.737956 ms, lb 5.616747 ms, ub 5.965689 ms, ci 0.950
std dev: 825.2361 us, lb 442.1652 us, ub 1.232003 ms, ci 0.950

benchmarking seqSplit 16
mean: 3.677038 ms, lb 3.669035 ms, ub 3.685547 ms, ci 0.950
std dev: 42.18741 us, lb 36.57112 us, ub 49.93574 us, ci 0.950

benchmarking seqSplit 32
mean: 2.855626 ms, lb 2.849962 ms, ub 2.862226 ms, ci 0.950
std dev: 31.25475 us, lb 26.49104 us, ub 37.18611 us, ci 0.950
  25,154,069,064 bytes allocated in the heap
   4,120,506,464 bytes copied during GC
      32,344,120 bytes maximum residency (446 sample(s))
       4,042,704 bytes maximum slop
              78 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 42092 collections,     0 parallel,  6.57s,  6.57s elapsed
  Generation 1:   446 collections,     0 parallel,  2.62s,  2.62s elapsed

  INIT  time    0.00s  (  0.00s elapsed)

  MUT   time   18.57s  ( 18.58s elapsed)
  GC    time    9.19s  (  9.19s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   27.76s  ( 27.77s elapsed)

  %GC time      33.1%  (33.1% elapsed)

  Alloc rate    1,354,367,579 bytes per MUT second

  Productivity  66.9% of total user, 66.9% of total elapsed

最佳答案

注:这个答案实际上并没有回答这个问题。它只是以不同的方式重申了这个问题。 Data.Sequence.foldr 的确切原因随着序列变大而减慢仍然未知。

你的代码

seqFoldr :: S.Seq Int -> Int
seqFoldr = F.foldr (+) 0

根据序列的长度具有非线性性能。看看这个基准:
./seq-customized +RTS -s -A128M

[Length] [Performance of function seqFoldr]
25000:   mean: 1.096352 ms, lb 1.083301 ms, ub 1.121152 ms, ci 0.950
50000:   mean: 2.542133 ms, lb 2.514076 ms, ub 2.583209 ms, ci 0.950
100000:  mean: 6.068437 ms, lb 5.951889 ms, ub 6.237442 ms, ci 0.950
200000:  mean: 14.41332 ms, lb 13.95552 ms, ub 15.21217 ms, ci 0.950

使用以 25000 为基础的线为我们提供了下表:
[Length] [Performance of function seqFoldr]
1x:      mean: 1.00 = 1*1.00
2x:      mean: 2.32 = 2*1.16
4x:      mean: 5.54 = 4*1.39
8x:      mean: 13.15 = 8*1.64

在上表中,非线性由系列 1.00、1.16、1.39、1.64 证明。

另见 http://haskell.org/haskellwiki/Performance#Data.Sequence_vs._lists

假设 Seq 的初始长度 xs是 100000 和 n是 32,你的代码
seqSplit n xs =
let (a, b) = S.splitAt (S.length xs `div` n) xs
    sa = seqFoldr a
    sb = seqSplit (n-1) b
in  sa + sb

将一些较短的 Seqs 传递给函数 seqFoldr .从上述代码传递给函数 seqFoldr 的序列的连续长度看起来像:
(length xs)/n = (length a)
--------------------------
100000/32 = 3125
(100000-3125)/31 = 3125
(100000-2*3125)/30 = 3125
...
(100000-30*3125)/2 = 3125

根据我回答的第一部分(我们看到性能是非线性的),[32 次调用 seqFoldr Seq 长度为 3125] 的执行速度将比 [1 call to seqFoldr 快具有长度为 32*3125=100000] 的单个 Seq。

因此,您的问题的答案是:因为foldrData.Sequence随着序列变大,速度会变慢。

关于performance - 为什么通过分而治之对 Data.Sequence 求和更快,即使没有并行性?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7869359/

相关文章:

haskell - 如何在 Haskell 中使用没有隐含前奏的 fromInteger?

php - json 列与多列

c++ - 强制 GCC 执行 memcpy 运行时大小检查的循环取消切换?

linux - 如何在Linux中的Elasticsearch中解决此错误

haskell - Haskell 中的伪随机数生成器

haskell - 何时/为何使用 MVar 而不是 TVar

haskell - 为什么容器类型没有类型类?

.net - String.Join 与 StringBuilder : which is faster?

performance - 如何使rust代码并行运行更快?

haskell - 我的(尝试的)iterateM 实现有什么问题?