haskell - Haskell 的类型系统可以强制数据管道阶段的正确排序吗?

标签 haskell pipeline type-systems type-level-computation

我使用质谱数据创建了许多数据处理管道,其中来自仪器的数据被清理、转换、缩放、检查并最终分析。我倾向于为此使用递归类型定义——这是一个非常简化的示例:

data Dataset = Initial { x::(Vector Double), y::(Vector Double) name::String}
             | Cleaned { x::(Vector Double), y::(Vector Double) name::String}
             | Transformed { x::(Vector Double), y::(Vector Double) name::String}
那么一个典型的管道将只是一个以 Dataset 开头的函数链。创建者,然后继续使用类型为 Dataset 的函数, 并产生 Dataset 类型的东西:
createDataset :: Vector Double -> Vector Double -> String -> Dataset
createDataset x y name = Initial x y name

removeOutliers :: Dataset -> Dataset
removeOutliers (Initial x y n) = let
                         (new_x, new_y) = outlierRemovalFunction x y
                         in Cleaned new_x new_y (n ++"_outliersRemoved")
               (Cleaned x y n) = error "Already been cleaned"
               (Scaled x y n) = error "Scaled data should have already been cleaned"
               (Transformed x y n) = error "Transformed data should have already been cleaned"

logTransform :: Dataset -> Dataset
logTransform (Initial x y n) = error "Need to clean first"
             (Cleaned x y n) = let
                         (new_x, new_y) = logTransformFunction x y
                         in Transformed new_x new_y (n ++ "_logTransformed)


因此,这可以确保管道中的处理步骤以正确的顺序发生,并且您可以使用组合创建整个管道
(logTransform . removeOutliers . createDataset) init_y init_y "ourData"
但由于几个原因,这种方法似乎极其有限。第一个原因是通过构造函数上的模式匹配来检测不正确性,因此对管道的添加和更改将需要在模式匹配中的任何地方进行更改。想象一个更复杂的例子,有几个清理和几个转换步骤——基本上每个可能的组合都需要它自己独特的构造函数,所有的模式匹配都必须是非穷举的,或者在任何地方都绝对重复。
这似乎有限的第二个原因是,构造不正确的管道只能在运行时被故障检测到。我已经对所有处理步骤进行了排序,因此在管道中的每一点我都确切地知道数据发生了什么。类型系统应该能够防止我首先错误地将步骤组合在一起,并且在编译时应该可以检测到在未清理的输入上使用期望清理数据的函数。
我考虑过为管道中的每个阶段设置单独的类型,然后将“数据集”接口(interface)实现为类型类,例如:
class Dataset a where
    x :: a -> Vector Double
    y :: a -> Vector Double
    name :: a -> String

data Initial = Initial x y name
instance Dataset Initial where ...

data Cleaned a = Cleaned a
instance Dataset Cleaned where ...

data Transformed a = Transformed a
instance Dataset Transformed where ...
那么你可以做一些事情(我认为......),比如:

removeOutliers :: (Dataset a) => a -> Cleaned a
removeOutliers = ...

logTransform :: (Dataset a) => Cleaned a -> Transformed Cleaned a
logTransform = ...
我相信这种方法解决了上面的问题 1:我们现在可以在编译时检测管道的不正确性,并且我们不再需要使用所有不同的构造函数来描述处理步骤。
但是,似乎我只是将问题“上移了一级”。我现在正在处理类型变量和所有这些嵌套类型。而不是需要 Dataset对于每个可能的管道步骤组合的构造函数,我现在需要创建一个 Dataset每个类型组合的实例!
我真正想要的是一种方法,让处理管道中的类型在其约束中既非常具体又非常笼统。我想使用类型/约束来详细说明应用特定处理步骤的顺序,但我也希望类型/约束能够传达更一般的东西——即“除了其他不重要的步骤,已完成异常值删除”。所以基本上已经删除了异常值的类型。
传达订购信息将是一个 super 奖励——“除了其他不重要的步骤之外,还发生了异常值删除,并且在稍后的某个时间点发生了对数转换”。在对数转换之前已删除异常值的事物类型(不一定是在之前)。
使用 Haskell 的类型系统可以做这种事情吗?

最佳答案

是的,现代的 Haskell 类型系统可以处理这个问题。但是,与通常的术语级编程相比,Haskell 中的类型级编程仍然很难。语法和技术很复杂,并且文档有些缺乏。还有一种情况是,对需求的相对较小的更改可能会导致实现中的大变化(即,向您的实现添加新的“功能”可能会级联成所有类型的重大重组),这可能会使其变得困难如果您仍然不确定您的实际要求是什么,请提出解决方案。
@JonPurdy 的评论和 @AtnNn 的回答给出了一些可能的想法。这是一个尝试满足您的特定要求的解决方案。但是,除非您愿意坐下来自学一些类型级编程,否则它可能很难使用(或至少难以适应您的要求)。
无论如何,假设您有兴趣使用已在其上执行的进程的类型级别列表来标记固定数据结构(即,始终具有相同类型的相同字段),并使用一种方法来检查进程列表所需进程的有序子列表。
我们需要一些扩展:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
进程标签本身被定义为 sum 类型的构造函数,DataKinds扩展将标签从术语级别提升到类型级别:
data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed
然后用应用进程列表标记数据结构,即它的“管道”:
data Dataset (pipeline :: [Process])
  = Dataset { x :: [Double]
            , y :: [Double]
            , name :: String }
注意:管道以相反的顺序最方便,最近应用的 Process第一的。
为了允许我们要求 pipeline有一个特定的有序进程子序列,我们需要一个类型级函数(即类型族)来检查子序列。这是一个版本:
type family a || b where
  True  || b = True
  False || b = b

type family Subseq xs ys where
  Subseq '[]      ys  = True
  Subseq nonempty '[] = False
  Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
  Subseq xs     (y:ys) = Subseq xs ys
我们可以在 GHCi 中测试这个类型级别的函数:
λ> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned] :: Bool
= 'True
λ> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned] :: Bool
= 'False
λ> :kind! Subseq '[Inspected, Transformed] '[Transformed, Inspected]
Subseq '[Inspected, Transformed] '[Transformed, Inspected] :: Bool
= 'False
如果您要编写一个函数,该函数需要对数据集进行转换然后清除异常值(按此顺序),可能与函数本身应用缩放步骤的其他不重要步骤混合,则签名将如下所示:
-- remember: pipeline type is in reverse order
foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined
如果要防止双重缩放,可以引入另一个类型级函数:
type family Member x xs where
  Member x '[] = 'False
  Member x (x:xs) = 'True
  Member x (y:xs) = Member x xs
并添加另一个约束:
foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
        , Member Scaled pipeline ~ False)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined
然后:
> foo2 (Dataset [] [] "x" :: Dataset '[Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
  :: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]
在约束语法和错误消息方面,您可以使用一些额外的类型别名和类型系列使其更加友好:
import Data.Kind
import GHC.TypeLits

type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
  Require1 True procs pipeline = ()
  Require1 False procs pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
  Forbid1 False proc pipeline = ()
  Forbid1 True proc pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " must not include " :<>: ShowType proc)

foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined
这使:
> foo3 (Dataset [] [] "x" :: Dataset '[Transformed])
...The pipeline '[ 'Transformed] lacks required processing '[ 'Cleaned, 'Transformed]...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
...The pipeline '[ 'Cleaned, 'Scaled, 'Transformed] must not include 'Scaled...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
  :: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]
完整的代码示例:
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}

import Data.Kind
import GHC.TypeLits

data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed

data Dataset (pipeline :: [Process])
  = Dataset { x :: [Double]
            , y :: [Double]
            , name :: String }

type family a || b where
  True  || b = True
  False || b = b

type family Subseq xs ys where
  Subseq '[]      ys  = True
  Subseq nonempty '[] = False
  Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
  Subseq xs     (y:ys) = Subseq xs ys

type family Member x xs where
  Member x '[] = False
  Member x (x:xs) = True
  Member x (y:xs) = Member x xs

type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
  Require1 True procs pipeline = ()
  Require1 False procs pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
  Forbid1 False proc pipeline = ()
  Forbid1 True proc pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " must not include " :<>: ShowType proc)


foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined

foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
        , Member Scaled pipeline ~ False)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined

foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined

关于haskell - Haskell 的类型系统可以强制数据管道阶段的正确排序吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63674412/

相关文章:

types - Erlang 的打字机推导出奇怪的字符串类型

windows - 将帧从视频流式传输到管道并通过 ffmpeg 将其发布到 HTTP mjpeg

powershell - 选择对象 - 首先影响管道中的先前 cmdlet

linear-algebra - 单位如何流经矩阵运算?

haskell - "non-Exhaustive patterns in function listelem" haskell

jenkins - Jenkins 在 devops 管道中的位置如何?

javascript - 结构类型不是鸭子类型(duck typing)

haskell - 如何停止 GHCi 中的无限评估?

Haskell 分析

haskell - Haskell 中的幻像类型