python - 保存部分 sklearn 管道

模型中的一些特征可能需要一些时间才能生成，因此要快速试验多个特征和参数，最好将它们保存到磁盘以备后用。

作为一个具体示例(取自 here )，假设我有以下管道:

pipeline = Pipeline([
  ('extract_essays', EssayExractor()),
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts', CountVectorizer()),
      ('tf_idf', TfidfTransformer())
    ])),
    ('essay_length', LengthTransformer()),
    ('misspellings', MispellingCountTransformer())
  ])),
  ('classifier', MultinomialNB())
])

我想将 CountVectorizer() 更改为 CountVectorizer(max_features=1000)，然后只有 CountVectorizer，MultinomialNB 需要重新计算，因为之前的参数或转换已经改变。

这能以某种方式实现吗？

最佳答案

我用 Pachyderm 做了一些成功的事.它有一个有点像 git 的 cli，可以让你存储你的工作流程。在 repo 协议(protocol)中，记下 ML pipeline for Iris Classification该示例提供了一些有关如何创建管道和训练数据并将其保存到他们所谓的“推理管道”中的详细信息，该管道将允许您尝试进行各种转换并应用推断的管道训练数据。

关于python - 保存部分 sklearn 管道，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31905686/

上一篇：python - Django haystack whoosh 超慢

下一篇：python - 如何避免 mako %def 中的重复过滤器规范？

python - 如何强制 Google App Engine [python] 使用 SSL (https)？

python - Pandas:如果列中包含子字符串，则替换列中的某些值

python - tfidfvectorizer 在保存的分类器中预测

python - 如何将 csv 或 arff 导入到 scikit？

python - 在 Jupyter 笔记本中嵌入的 Bokeh 实时绘图中设置 x_axis_limit

python - 如何在用 python 编写的 tail 函数中附加一些数据

opencv - 特征提取后使用SVM进行裸体检测算法

python - 如何找到一个投影来保留内积的相对值？

machine-learning - 为什么以下部分拟合不起作用？