scala - 如何使用CrossValidator在不同模型之间进行选择

标签 scala apache-spark apache-spark-mllib cross-validation

我知道我可以使用 CrossValidator调整单个模型。但是,用于评估不同模型之间的建议方法是什么?例如,假设我想评估 LogisticRegression针对 LinearSVC 的分类器分类器使用 CrossValidator .

最佳答案

在熟悉了 API 之后,我通过实现自定义 Estimator 解决了这个问题。包装两个或多个可以委托(delegate)给的估计器,其中选定的估计器由单个 Param[Int] 控制。这是实际的代码:

import org.apache.spark.ml.Estimator
import org.apache.spark.ml.Model
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.param.Params
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types.StructType

trait DelegatingEstimatorModelParams extends Params {
  final val selectedEstimator = new Param[Int](this, "selectedEstimator", "The selected estimator")
}

class DelegatingEstimator private (override val uid: String, delegates: Array[Estimator[_]]) extends Estimator[DelegatingEstimatorModel] with DelegatingEstimatorModelParams {
  private def this(estimators: Array[Estimator[_]]) = this(Identifiable.randomUID("delegating-estimator"), estimators)

  def this(estimator1: Estimator[_], estimator2: Estimator[_], estimators: Estimator[_]*) = {
    this((Seq(estimator1, estimator2) ++ estimators).toArray)
  }

  setDefault(selectedEstimator -> 0)

  override def fit(dataset: Dataset[_]): DelegatingEstimatorModel = {
    val estimator = delegates(getOrDefault(selectedEstimator))
    val model = estimator.fit(dataset).asInstanceOf[Model[_]]
    new DelegatingEstimatorModel(uid, model)
  }

  override def copy(extra: ParamMap): Estimator[DelegatingEstimatorModel] = {
    val that = new DelegatingEstimator(uid, delegates)
    copyValues(that, extra)
  }

  override def transformSchema(schema: StructType): StructType = {
    // All delegates are assumed to perform the same schema transformation,
    // so we can simply select the first one:
    delegates(0).transformSchema(schema)
  }
}

class DelegatingEstimatorModel(override val uid: String, val delegate: Model[_]) extends Model[DelegatingEstimatorModel] with DelegatingEstimatorModelParams {
  def copy(extra: ParamMap): DelegatingEstimatorModel = new DelegatingEstimatorModel(uid, delegate.copy(extra).asInstanceOf[Model[_]])

  def transform(dataset: Dataset[_]): DataFrame = delegate.transform(dataset)

  def transformSchema(schema: StructType): StructType = delegate.transformSchema(schema)
}

评估LogistcRegression反对LinearSVC上面的类可以这样使用:

val logRegression = new LogisticRegression()
  .setFeaturesCol(columnNames.features)
  .setPredictionCol(columnNames.prediction)
  .setRawPredictionCol(columnNames.rawPrediciton)
  .setLabelCol(columnNames.label)

val svmEstimator = new LinearSVC()
  .setFeaturesCol(columnNames.features)
  .setPredictionCol(columnNames.prediction)
  .setRawPredictionCol(columnNames.rawPrediciton)
  .setLabelCol(columnNames.label)

val delegatingEstimator = new DelegatingEstimator(logRegression, svmEstimator)

val paramGrid = new ParamGridBuilder()
  .addGrid(delegatingEstimator.selectedEstimator, Array(0, 1))
  .build()

val model = crossValidator.fit(data)

val bestModel = model.bestModel.asInstanceOf[DelegatingEstimatorModel].delegate

关于scala - 如何使用CrossValidator在不同模型之间进行选择,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48971317/

相关文章:

apache-spark - Spark SQL 中的 Tungsten 编码?

scala - 如何使用 Spark Structured Streaming 逐 block 处理文件?

java - 如何在 Java Spark MLLib 中使用映射和归约合并文本文件?

Scala:如何使用 map 代替 isEmpty 和 if

java - 使用spark-cassandra-connector时出错: java. lang.NoSuchMethodError

类型单元的 Scala 表达式不确认类型字符串

apache-spark - Spark - KMeans.train 中的 IllegalArgumentException

scala - Spark中进行特征选择后,使测试数据的特征与训练数据相同

scala - 检查 Scala 类是否是 T 的实例

html - 如何在 Lift 框架中添加新页面