performance - Haskell 性能——拓扑排序不够快

我是 Haskell 初学者，选择它来解决我类(class)的编程任务，但是我的解决方案太慢了，没有被接受。我正在尝试对其进行概要分析，并希望我能从这里的更高级的 Haskellers 那里得到一些指示。

到目前为止，我类唯一被接受的解决方案是用 Rust 编写的。我确信我应该能够在 Haskell 中实现类似的性能，并且我编写了可怕的命令式代码以希望提高性能，可惜无济于事。

我的第一个怀疑与work 相关，我在其中使用forever 遍历入度数组，直到出现越界异常。我希望这是尾递归的，并编译成 while (true) 样式循环。

我的第二个怀疑是 I/O 可能会减慢速度。

编辑:问题可能与我的算法有关，因为我没有保留入度为 0 的节点队列。谢谢@luqui。

EDIT2:看来真正的瓶颈是 I/O，感谢@Davislor，我解决了这个问题。

任务基于此:http://www.spoj.com/UKCPLAD/problems/TOPOSORT/我只能使用 Haskell 平台中的库。

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE LambdaCase #-}
{-# OPTIONS_GHC -O3 #-}

import Control.Monad
import Data.Array.IO
import Data.IORef
import Data.Int
import Control.Exception

type List = []
type Node = Int32
type Edge = (Node, Node)
type Indegree = Int32

main = do
  (numNodes, _) <- readPair <$> getLine
  edges <- map readPair . lines <$> getContents
  topo numNodes edges

-- lower bound
{-# INLINE lb #-}
lb = 1

topo :: Node -> List Edge -> IO ()
topo numNodes edges = do
    result <- newIORef []
    count <- newIORef 0
    indegrees <- newArray (lb,numNodes) 0 :: IO (IOUArray Node Indegree)
    neighbours <- newArray (lb,numNodes) [] :: IO (IOArray Node (List Node))
    forM_ edges $ \(from,to) -> do
      update indegrees to (+1)
      update neighbours from (to:)
    let work = forever $ do
          z <- getNext indegrees
          modifyIORef' result (z:)
          modifyIORef' count (+1)
          ns <- readArray neighbours z
          forM_ ns $ \n -> update indegrees n pred
    work `catch`
      \(_ :: SomeException) -> do
        count <- readIORef count
        if numNodes == count
          then (mapM_ (\n -> putStr (show n ++ " ")) . reverse) =<< readIORef result
          else putStrLn "Sandro fails."


{-# INLINE update #-}
update a i f = do
  x <- readArray a i
  writeArray a i (f x)

{-# INLINE getNext #-}
getNext indegrees = getNext' indegrees =<< getBounds indegrees

{-# INLINE getNext' #-}
getNext' indegrees (lb,ub) = readArray indegrees lb >>= \case
    0 -> writeArray indegrees lb (-1) >> return lb
    _ -> getNext' indegrees (lb+1,ub)

readPair :: String -> (Node,Node)
{-# INLINE readPair #-}
readPair = toPair . map read . words
  where toPair [x,y] = (x,y)
        toPair _ = error "Only two entries per line allowed"

示例输出

$ ./topo
8 9
1 4
1 2
4 2
4 3
3 2
5 2
3 5
8 2
8 6
^D
1 4 3 5 7 8 2 6

最佳答案

如果你还没有，profile your program通过使用 -prof -fprof-auto 进行编译，然后使用命令行选项 +RTS -p 执行。这将生成一个配置文件 *.prof，它会告诉您程序将所有时间花在了哪些函数上。但是，我可以立即看到最浪费时间的地方。您的直觉是正确的:是 I/O。

做了很多之后，我可以向您保证，您会发现它的绝大部分时间都花在了 I/O 上。要加快程序速度，您应该始终做的第一件事就是重写它以使用快速 I/O。 Haskell 是一种快速的语言，当您使用正确的数据结构时。 Prelude 中的默认 I/O 库使用单链表和延迟计算的 thunk，其中每个节点包含一个 Unicode 字符。这在 C 中也会很慢!

当输入为 ASCII 时，我使用 Data.ByteString.Lazy.Char8 获得了最佳结果，而使用 Data.ByteString.Builder 生成输出。 (另一种方法是 Data.Text。)这会在输入时为您提供一个延迟计算的严格字符缓冲区列表(因此交互式输入和输出仍然有效)，并在输出时填充单个缓冲区。

在编写了具有快速 I/O 的程序框架后，下一步就是查看算法，尤其是数据结构。使用分析来查看所有时间都去了哪里。但我建议您使用函数式算法，而不是尝试使用 do 在 Haskell 中编写命令式程序。

在 Haskell 中，我几乎总是使用更函数式的风格来处理此类问题:特别是，我的 main 函数几乎总是类似于:

import qualified Data.ByteString.Lazy.Char8 as B8

main :: IO()
main = B8.interact ( output . compute . input )

这使得除了调用 interact 之外的所有内容都是纯函数，并且隔离了解析代码和格式化代码，因此中间的 compute 部分可以独立于此.

因为这是一个作业，你想自己解决问题，我不会为你重构程序，但这是我在回答另一个论坛上的一个问题时写的一个例子来执行计数排序。它应该适合作为其他类型问题的骨架。

import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))

main :: IO()
main = B8.interact ( output . compute . input ) where
  input :: B8.ByteString -> [Int]
  input = map perLine . tail . B8.lines where
    perLine = decode . B8.readInt

    decode (Just (x, _)) = x
    decode Nothing = error "Invalid input: expected integer."

  compute :: [Int] -> [Int]
  compute = concatMap expand . assocs . countingSort . map encode where
    encode i = (i, 1)

    countingSort :: [(Int, Int)] -> UArray Int Int
    countingSort = accumArray (+) 0 (lower, upper)

    lower = 0
    upper = 1000000

    expand (i,c) = replicate c i

  output :: [Int] -> B8.ByteString
  output = toLazyByteString . foldMap perCase where
    perCase :: Int -> Builder
    perCase x = intDec x <> char7 '\n'

目前，这个版本的运行时间不到其他人的一半Haskell solution for the same problem ，这同样适用于我用它解决的实际比赛问题，并且the approach generalizes .

因此，我建议先将 I/O 更改为与此类似，然后进行性能分析，如果没有产生足够的差异，则返回性能分析输出。这也可能是一个很好的代码审查问题。

关于performance - Haskell 性能——拓扑排序不够快，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49162717/

performance - Haskell 性能——拓扑排序不够快

上一篇：join - 如何有效地将大 rdd 加入 spark 中非常大的 rdd？

下一篇：xna - 通过将 2D Sprite 与 3D 模型混合来解决 Z 缓冲区问题