Scala - MaxBins 错误 - 决策树 - 分类变量

我的错误与这 2 篇文章类似，尝试了这些可能性，但仍然看到下面的错误:: CLOUDERA && STACK OVERFLOW

   var categoricalFeaturesInfo = Map[Int, Int]()
       categoricalFeaturesInfo += (0 -> 31)
       categoricalFeaturesInfo += (1 -> 7)

java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 3) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 31 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

   val numClasses = 2
   val impurity = "gini"
   val maxDepth = 9
   val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)

问题:最大的分类变量是 31 ，我尝试过 maxBins = 32 (根据这些帖子中的答案)。我错过了什么吗？

就像尝试 n 错误一样，我尝试了所有值集，例如 2 、 3 10 、 15 、 50 、 10000 ，看到相同的错误。 !

使用的 map 功能:

val mlprep = flightsRDD.map(flight => {
  val monthday = flight.dofM.toInt - 1 // category
  val weekday = flight.dofW.toInt - 1 // category
})

最佳答案

我在使用 PySpark 时遇到了同样的错误。可能有多种原因:

1) 要确保 maxBins 准确，请使其等于每个分类列的不同分类值数量的最大值。

maxBins = max(categoricalFeaturesInfo.values() )

2) 错误消息显示

...but categorical feature 0 has 31 values...

trainingData 的第 0 列(第一个列，不是第一个特征)实际上是训练集的标签吗？他们一定!默认情况下，DecisionTree.trainClassifier 将第一列视为标签。确保标签列是 trainingData 的第一个列，而不是功能之一。

3) 你是如何获得trainingData的？ DecisionTree.trainClassifier 对我来说适用于解析为 LabeledPoint 的表，就像 RandomForest.trainClassifier 一样，请参阅 http://jarrettmeyer.com/2017/05/04/random-forests-with-pyspark 。 (*)

4)此外，在将数据集转换为 LabeledPoint RDD 之前，首先转换原始数据帧以对分类列进行索引。

对我有用的是首先使用Pipeline转换源数据帧，每个阶段都包含一个StringIndexer转换，用于附加其值是索引分类列的另一列，然后将它们转换为 LabeledPoint。

总之，它在 PySpark 中对我来说的工作方式如下:

假设原始数据帧存储在 df 变量中，其分类特征名称数组存储在 categoricalFeatures variable-list-array-whateverYouCallIt 中。

导入Pipeline和StringIndexer(*):

from pyspark.ml import Pipeline
pyspark.ml.feature import StringIndexer

要建立管道阶段，请创建一个 StringIndexer 数组，每个数组索引一个分类列 (*)。请参阅https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

indexers = [ StringIndexer(inputCol=column, outputCol=column) for column in categoricalFeatures ]

这里要小心，因为 Spark 1.6 版本没有为 StringIndexer 实例实现 handleInvalid="keep" 方法，因此您需要替换 NULL 运行此阶段后的值。请参阅https://weishungchung.com/2017/08/14/stringindexer-transform-fails-when-column-contains-nulls/

设置管道:(*)

pipeline = Pipeline( stages=indexers )

现在运行转换:

df_r= pipeline.fit(df).transform(df)

如果此处出现问题，请尝试更改 outputCol 值以获取 索引器 中不同的值。如果 df 中存在 NULL 值，则会引发 NullPointerException。

现在，categoricalFeatures 列表中的所有(分类)列都在 df_r 中建立索引。如果您在初始化索引器时更改了outputCol的某些值，则应从df_r中删除该原始列(其名称为inputCol值) .

最后，使用标记点声明您的trainingData:(*)

from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[0], Vectors.dense(row[1:])))

此处，df_r 的所有列都必须是数字(因此分类列已转换为索引列)，并且标签列是 df_r 中的列号 0。如果不是，假设它是列 i，更改它:

trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[i], Vectors.dense(row[:i]+row[i+1:])))

这种方式创建trainingData对我来说很有效。

还有一种快速简便的方法可以从 df_r 元数据获取 categoricalFeaturesInfo:令 k 为转换后的分类列的索引字符串索引器，

df_r.schema.fields[k].metadata['ml_attr']['vals']

存储原始值，您只需对它们全部进行计数即可知道该列号中有多少个不同的值，并且您还可以从那里恢复原始值，而不是使用 IndexToString .

问候。

(*) 只需进行少量更改，您就可以在 Scala 中执行相同的操作。

关于Scala - MaxBins 错误 - 决策树 - 分类变量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47393121/

Scala - MaxBins 错误 - 决策树 - 分类变量

上一篇：python - 在 R 中使用自定义分词器将文本转换为向量？

下一篇：machine-learning - 使用 stanford NLP 查找描述上下文的句子