python - Spark异常: Chi-square test expect factors

标签 python apache-spark pyspark chi-squared

我有一个包含 42 个特征和 1 个标签的数据集。我想在执行决策树之前应用 Spark ML 库的选择方法卡方选择器来检测异常，但在应用卡方选择器期间遇到此错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 11.

这是我的源代码:

from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",outputCol="features2", labelCol="label")
result = selector.fit(dfa1).transform(dfa1)
result.show()

最佳答案

正如您在错误消息中看到的，您的 features 列包含向量中超过 10000 个不同的值，并且看起来它们是连续的而不是分类的，ChiSq 只能处理 10k 类别，并且您无法增加此值值。

  /**
   * Max number of categories when indexing labels and features
   */
  private[spark] val maxCategories: Int = 10000

在这种情况下，您可以使用 VectorIndexer 和 .setMaxCategories() 参数 < 10k 来准备数据。您可以尝试其他方法来准备数据，但只有在向量中不同值的计数> 10k 时，它才会起作用。

关于python - Spark异常: Chi-square test expect factors，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58608274/

上一篇：python - 如何使用 django 在多模型中进行搜索

下一篇：java - 类型错误 : 'JavaPackage' object is not callable (spark. _jvm)

python - Pytest:如何通过输入调用测试单独的函数？

apache-spark - 即使在设置了相当长的超时值 1000 秒后，Spark 执行程序也会因超时而丢失

scala - 使用 Scala 类作为 UDF 与 pyspark

python - 如何将参数传递给 agg pyspark 函数的字典输入

pyspark 连接两个 rdd 并展平结果

python - 我在哪里可以找到好的 python Twisted 框架文档、博客条目、文章等？

python - Pandas:提取列名称与另一列中的行值匹配的列值

apache-spark - Spark SQL 窗口函数导致数据分布倾斜

spring - 分布式计算的独立数据访问层