python - 在 Spark ML 中创建自定义交叉验证

我不熟悉 Spark 和 PySpark 数据框以及机器学习。如何为 ML 库创建自定义交叉验证。例如，我想改变训练折叠的形成方式，例如分层拆分。

这是我当前的代码

numFolds = 10
predictions = []

lr = LogisticRegression()\
     .setFeaturesCol("features")\
     .setLabelCol('label')

# Grid search on LR model
lrparamGrid = ParamGridBuilder()\
     .addGrid(lr.regParam, [0.01, 0.1, 0.5, 1.0, 2.0])\
     .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1.0])\
     .addGrid(lr.maxIter, [5, 10, 20])\
     .build()

pipelineModel = Pipeline(stages=[lr])
evaluator = BinaryClassificationEvaluator()

cv = CrossValidator()\
     .setEstimator(pipelineModel)\
     .setEvaluator(evaluator)\
     .setEstimatorParamMaps(lrparamGrid).setNumFolds(5)

# My own Cross-Validation with stratified splits
for i in range(numFolds):
     # Use Stratified indexOfStratifiedSplits
     trainingData = df[df.ID.isin(indexOfStratifiedSplits[i][0])]
     testingData = df[df.ID.isin(indexOfStratifiedSplits[i][1])]

# Training and Grid Search
cvModel = cv.fit(trainingData)
predictions.append(cvModel.transform(testingData))

我想要一个像这样调用的交叉验证类

cv = MyCrossValidator()\
     .setEstimator(pipelineModel)\
     .setEvaluator(evaluator)\
     .setEstimatorParamMaps(lrparamGrid).setNumFolds(5)\
     # Option 1
     .setSplitIndexes(indexOfStratifiedSplits)
     # Option 2
     .setSplitType("Stratified",ColumnName)

我不知道最好的选择是创建一个扩展 CrossValidation.fit 的类吗？或 Passing Functions to Spark .作为新手，这两种选择对我来说都是挑战，我尝试复制 GitHub 代码，但我遇到了很多错误，特别是我不会说 Scala，但这个管道在 Scala API 中更快。

虽然我有自己的函数以我想要的方式拆分数据(基于 sklearn)，但我想同时使用管道、网格搜索和 cv，这样所有的排列都是分布式的，而不是在 master 中执行。带有“我自己的交叉验证”的循环仅使用部分集群节点，因为循环发生在主/驱动程序中。

任何 Python 或 Scala API 都可以，但最好是 Scala。

谢谢

最佳答案

在 Python 中，Sklearn 为您提供了 sklearn.cross_validation.StratifiedKFold 函数。您可以使用 Sparkit-learn旨在在 PySpark 上提供 scikit-learn 功能和 API。

关于python - 在 Spark ML 中创建自定义交叉验证，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33511529/

python - 在 Spark ML 中创建自定义交叉验证

上一篇：python - 在模板中使用特定的平面图

下一篇：python - django:从迁移中排除模型