string - Haskell解决InterviewStreet字符串相似性挑战的解决方案

这是我解决“面试街”挑战的最佳尝试。

import Control.Monad
import Data.Text as T
import qualified Data.Text.IO as TIO


sumSimilarities s = (T.length s) + (sum $ Prelude.map (similarity s) (Prelude.tail $ tails s))

similarity :: Text -> Text -> Int
similarity a b = case commonPrefixes a b of
                     Just (x,_,_) -> T.length x
                     Nothing -> 0

main = do
    cases <- fmap read getLine
    inputs <- replicateM cases TIO.getLine
    forM_ inputs $ print . sumSimilarities

它只通过了7/10的测试用例。测试用例7、8和9失败，因为它们超过了分配的执行时间。
我一半在验证haskell中确实可以解决这个问题，一半在寻找一个优化的haskell程序是什么样子的。
谢谢！
泰勒

最佳答案

像user5402一样，我想知道一个等价的（对于等价的某些值）c程序是在时间限制内完成还是超时。如果可以的话，我们很有兴趣看看使用ByteStrings的等效程序是否能及时完成。-并不是说ByteStrings本身比Text快，但是由于输入必须转换为Text的内部表示，而ByteString则视其为原样，所以这可能会有区别。另一个可能的原因是，如果测试机器有32位GHCs，ByteStrings可能更快，因为text的融合至少需要比通常在32位体系结构上可用的寄存器更多的寄存器才能获得全部利润[很久以前，在text-0.5到text-0.7的时代，在我的32位机器上，bytestring曾经更快不知道这是否适用于较新的text版本。
好吧，既然user5402已经证实了c语言中的na i ve算法足够快，我就用ByteString编写了一个naive算法的实现

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Control.Monad
import Data.Word

main :: IO ()
main = do
    cl <- C.getLine
    case C.readInt cl of
      Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
      Nothing -> return ()

-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex

similarity :: B.ByteString -> Int
similarity bs
    | len == 0  = 0
    | otherwise = go len 1
      where
        !len = B.length bs
        go !acc i
            | i < len   = go (acc + prf 0 i) (i+1)
            | otherwise = acc
        prf !k j
            | j < len && bs ? k == bs ? j   = prf (k+1) (j+1)
            | otherwise = k

并与一些不良病例的OPText版本进行了比较在我的盒子里，这个速度是Text版本的四倍多，所以这个速度是否足够快是很有趣的（c版本是另外一个4.5倍的速度，所以很可能不是）。
但是，我认为，由于使用了具有二次最坏情况行为的天真算法，因此更有可能超过时间限制。可能有一些测试用例会引发天真算法的最坏情况。
因此，解决方案是使用一种规模更好、线性最优的算法。计算字符串相似度的一种线性算法是Z-algorithm。
这个想法很简单（但是，和大多数好主意一样，不容易有）。让我们调用一个（非空的）子字符串，它也是字符串的前缀一个前缀子字符串。为了避免重新计算，该算法使用前缀子字符串的窗口，该窗口从当前考虑的索引（最右侧延伸最远）之前开始（最初，该窗口为空）。
使用的变量和算法的不变量：
正在考虑的索引从1开始（用于索引，未考虑整个字符串），并将其增加到i
length - 1和left，前缀子字符串窗口的第一个和最后一个索引；不变量：
right，left < i，或left <= right < length(S)或left > 0，
如果right < 1，那么left > 0是S[left .. right]和S的最大共同前缀，
如果S[left .. ]和1 <= j < i是前缀，则S[j .. k]
数组S，不变量：对于k <= right，Z包含最长公共前缀1 <= k < i和Z[k]的长度。
算法：
设置S[k .. ]，S（允许有i = 1的任何值），并为所有指数设置left = right = 0。
如果left <= right < 1，则停止。
如果Z[j] = 0，找到1 <= j < length(S)和i == length(S)的最长公共前缀的长度i > right，将其存储在l中。如果我们发现一个窗口比前一个窗口向右延伸得更远，那么设置S和S[i .. ]，否则保持不变。增加Z[i]并转到2。
这里l > 0，所以子字符串left = i是已知的-因为right = i+l-1是i的前缀，所以它等于left < i <= right。
现在考虑最长的公共前缀S[i .. right]，子字符串从索引S[left .. right]开始。
它的长度是S，因此S[i-left .. right-left]表示S，并且
i - left。现在，如果Z[i-left]，则S[k] = S[i-left + k]位于已知窗口内，因此

S[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]]
S[i + k]         = S[i-left + k]         = S[k]   for 0 <= k < Z[i-left]

我们发现，0 <= k < Z[i-left]和S[Z[i-left]] ≠ S[i-left+Z[i-left]]的最长公共前缀的长度为Z[i-left] <= right-i。
然后设置i + Z[i-left]，递增S，并转到2。
否则，S[i .. ]是Z[i-left]的前缀，我们检查它的扩展程度，开始比较索引Z[i] = Z[i-left]和i处的字符。长度设为S[i .. right]。设置S，right+1，right+1 - i，递增l，然后转到2。
由于窗口从不向左移动，并且比较总是在窗口结束后开始，因此字符串中的每个字符最多成功地与字符串中的较早字符进行一次比较，并且对于每个开始索引，最多有一次不成功的比较，因此算法是线性的。
代码（出于习惯使用Z[i] = l，应该可以轻松地移植到left = i）：

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Control.Monad
import Data.Word

main :: IO ()
main = do
    cl <- C.getLine
    case C.readInt cl of
      Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
      Nothing -> return ()

-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex

-- Calculate the similarity of a string using the Z-algorithm
similarity :: B.ByteString -> Int
similarity bs
    | len == 0  = 0
    | otherwise = runST getSim
      where
        !len = B.length bs
        getSim = do
            za <- newArray (0,len-1) 0 :: ST s (STUArray s Int Int)
            -- The common prefix of the string with itself is entire string.
            unsafeWrite za 0 len
            let -- Find the length of the common prefix.
                go !k j
                    | j < len && (bs ? j == bs ? k) = go (k+1) (j+1)
                    | otherwise = return k
                -- The window with indices in [left .. right] is the prefix-substring
                -- starting before i that extends farthest.
                loop !left !right i
                    | i >= len  = count 0 0 -- when done, sum
                    | i > right = do
                        -- We're outside the window, simply
                        -- find the length of the common prefix
                        -- and store it in the Z-array.
                        w <- go 0 i
                        unsafeWrite za i w
                        if w > 0
                          -- We got a non-empty common prefix and a new window.
                          then loop i (i+w-1) (i+1)
                          -- No new window, same procedure at next index.
                          else loop left right (i+1)
                    | otherwise = do
                        -- We're inside the window, so the substring starting at
                        -- (i - left) has a common prefix with the substring
                        -- starting at i of length at least (right - i + 1)
                        -- (since the [left .. right] window is a prefix of bs).
                        -- But we already know how long the common prefix
                        -- starting at (i - left) is.
                        z <- unsafeRead za (i-left)
                        let !s = right-i+1 -- length of known prefix starting at i
                        if z < s
                          -- If the common prefix of the substring starting at
                          -- (i - left) is shorter than the rest of the window,
                          -- the common prefix of the substring starting at i
                          -- is the same. Store it and move on with the same window.
                          then do
                              unsafeWrite za i z
                              loop left right (i+1)
                          else do
                              -- Otherwise, find out how far the common prefix
                              -- extends, starting at (right + 1) == s + i.
                              w <- go s (s+i)
                              unsafeWrite za i w
                              loop i (i+w-1) (i+1)
                count !acc i
                    | i == len  = return acc
                    | otherwise = do
                        n <- unsafeRead za i
                        count (acc+n) (i+1)
            loop 0 0 1

关于string - Haskell解决InterviewStreet字符串相似性挑战的解决方案，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12231495/

string - Haskell解决InterviewStreet字符串相似性挑战的解决方案

上一篇：algorithm - n 元组的平面排列(广度优先)

下一篇：algorithm - 最大矩形重叠点