scala - 在 1.6 中工作的 Spark ml 管道在 2.0 中不起作用。类型不匹配错误

标签 scala apache-spark machine-learning pipeline

所有, 我有以下适用于 Spark 1.6 的代码。

import org.apache.spark.ml.feature.{ChiSqSelectorModel,QuantileDiscretizer,VectorAssembler,ChiSqSelector}
import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.ml.Pipeline
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import scala.util.Random
import scala.math

val nRows = 1000
val nCols = 100
val rD = sc.parallelize(0 to nRows-1,172).map { _ => Row.fromSeq(Seq.fill(nCols)(Random.nextInt(10))) }

val schema = StructType((0 to nCols-1).map { i => StructField("C" + i, IntegerType, true) } )
val df = spark.createDataFrame(rD, schema)

val continuous = df.drop("C0").dtypes.filter (_._2 != "StringType") map (_._1)
val discretizers = continuous .map(c => new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}_disc").setNumBuckets(10))

val conDisc = continuous.map(c => s"${c}_disc")
val assembler = new  VectorAssembler().setInputCols(conDisc).setOutputCol("features")

val selector = new ChiSqSelector().setNumTopFeatures(100).setFeaturesCol("features").setLabelCol("C0").setOutputCol("selectedFeatures")
val pipeline = new Pipeline().setStages(Array.concat(discretizers.toArray, Array(assembler, selector)))
val model = pipeline.fit(df)

如何将其转换为在 2.0 中工作? 问题似乎出在离散器上。 Spark 抛出以下错误

 <console>:56: error: type mismatch;
 found   : Array[org.apache.spark.ml.feature.QuantileDiscretizer]
 required: Array[org.apache.spark.ml.PipelineStage with org.apache.spark.ml.para                                                                                                                         m.shared.HasOutputCol with org.apache.spark.ml.util.DefaultParamsWritable{def co                                                                                                                       py(extra: org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage                                                                                                                      with org.apache.spark.ml.param.shared.HasOutputCol with org.apache.spark.ml.uti                                                                                                                     l.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.ParamMap): org                                                                                                                     .apache.spark.ml.PipelineStage with org.apache.spark.ml.param.shared.HasOutputCo                                                                                                                     l with org.apache.spark.ml.util.DefaultParamsWritable}}]
 Note: org.apache.spark.ml.feature.QuantileDiscretizer <: org.apache.spark.ml.Pip                                                                                                                     elineStage with org.apache.spark.ml.param.shared.HasOutputCol with org.apache.sp                                                                                                                     ark.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.Para                                                                                                                     mMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.param.shared.H                                                                                                                     asOutputCol with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra:                                                                                                                      org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org.                                                                                                                     apache.spark.ml.param.shared.HasOutputCol with org.apache.spark.ml.util.DefaultP                                                                                                                     aramsWritable}}, but class Array is invariant in type T.    
 You may wish to investigate a wildcard type such as `_ <: org.apache.spark.ml.Pi                                                                                                                     pelineStage with org.apache.spark.ml.param.shared.HasOutputCol with org.apache.s                                                                                                                     park.ml.util.DefaultParamsWritable{def copy(extra: org.apache.spark.ml.param.Par                                                                                                                     amMap): org.apache.spark.ml.PipelineStage with org.apache.spark.ml.param.shared.                                                                                                                     HasOutputCol with org.apache.spark.ml.util.DefaultParamsWritable{def copy(extra:                                                                                                                      org.apache.spark.ml.param.ParamMap): org.apache.spark.ml.PipelineStage with org                                                                                                                     .apache.spark.ml.param.shared.HasOutputCol with org.apache.spark.ml.util.Default                                                                                                                     ParamsWritable}}`. (SLS 3.2.10)
   val pipeline = new Pipeline().setStages(Array.concat(discretizers.toArray                                                                                                                     , Array(assembler, selector)))

感谢您的帮助。

最佳答案

希望这对您有用:

val pipeline = new Pipeline().setStages(discretizers ++ Array(assembler, selector))

关于scala - 在 1.6 中工作的 Spark ml 管道在 2.0 中不起作用。类型不匹配错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43009257/

相关文章:

scala - spark中列表值的计数 - 数据帧

apache-spark - spark执行程序中的自定义log4j附加程序

python - 在为 Keras NN 做准备时应用 StandardScaler() 时遇到问题

xml - 使用拉式解析器的 Scala 内存泄漏

regex - 在Scala中的两个字符串之间提取字符串

json - 使用 Play Json 和 Salat 格式化可为空的 Seq 或对象列表

scala - 在 Spark 中将 Dataframe 转换为 Map(Key-Value)

java - Java Spark如何将JavaPairRDD <HashSet <String>,HashMap <String,Double >>保存到文件?

python-2.7 - 带 20 个新闻组的随机森林问题

python - 如何使用预训练的神经网络处理灰度图像?