apache-spark - 为什么 StandardScaler 不将元数据附加到输出列?

标签 apache-spark apache-spark-mllib apache-spark-ml

我注意到 ml StandardScaler不将元数据附加到输出列:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val df = spark.read.option("header", true)
  .option("inferSchema", true)
  .csv("/path/to/cars.data")

val strId1 = new StringIndexer()
  .setInputCol("v7")
  .setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
  .setInputCol("v8")
  .setOutputCol("v8_IDX")

val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
  .setOutputCol("featuresRaw")

val scalerModel = new StandardScaler()
  .setInputCol("featuresRaw")
  .setOutputCol("scaledFeatures")


val plm = new Pipeline()
  .setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
  .fit(df)

val dft = plm.transform(df)

dft.schema("scaledFeatures").metadata

给出:
res1: org.apache.spark.sql.types.Metadata = {}

此示例适用于 this dataset (只需调整上面代码中的路径)。

这有什么具体原因吗?将来有没有可能将这个功能加入到 Spark 中?关于不包括复制 StandardScaler 的解决方法的任何建议?

最佳答案

虽然丢弃元数据可能不是最幸运的选择,但缩放索引分类特征没有任何意义。 StringIndexer 返回的值只是标签。

如果你想缩放数值特征,它应该是一个单独的阶段:

val numericAssembler: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
  .setOutputCol("numericFeatures")

val scaler = new StandardScaler()
  .setInputCol("numericFeatures")
  .setOutputCol("scaledNumericFeatures")

val finalAssembler: VectorAssembler = new VectorAssembler() 
  .setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
  .setOutputCol("features")

new Pipeline()
  .setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
  .fit(df)

牢记本答案开头提出的问题,您还可以尝试复制元数据:

val result = plm.transform(df).transform(df => 
  df.withColumn(
   "scaledFeatures", 
   $"scaledFeatures".as(
     "scaledFeatures", 
     df.schema("featuresRaw").metadata)))

esult.schema("scaledFeatures").metadata

{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}

关于apache-spark - 为什么 StandardScaler 不将元数据附加到输出列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50701849/

相关文章:

scala - 如何在图中找到每个顶点的度-RDD转换?

python - 属性错误 : 'DataFrame' object has no attribute 'map'

apache-spark - 如何在可能为空的列上使用 PySpark CountVectorizer

apache-spark - 如何在 Spark-YARN 上设置每个任务的最大允许执行时间?

apache-spark - Spark Streaming 批处理之间的数据共享

apache-spark - Spark中的特征归一化算法

java - 如何将 csv 字符串转换为 Spark-ML 兼容的 Dataset<Row> 格式?

apache-spark - Spark 异常 : Values to assemble cannot be null

scala - 无法在 Spark 2.0 中的数据集 [(scala.Long, org.apache.spark.mllib.linalg.Vector)] 上运行 LDA

apache-spark - Apache Spark ALS 建议方法