apache-spark - 列变换后的 Pyspark 随机森林特征重要性映射

标签 apache-spark pyspark apache-spark-sql apache-spark-mllib

我试图用列名绘制某些基于树的模型的特征重要性。我正在使用 Pyspark。

由于我也有文本分类变量和数字变量,我不得不使用类似这样的管道方法 -

  • 使用字符串索引器索引字符串列
  • 对所有列使用一个热编码器
  • 使用 vectorassembler 创建包含特征向量的特征列

    来自 docs 的一些示例代码对于步骤 1,2,3 -
    from pyspark.ml import Pipeline
    from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
    VectorAssembler
    categoricalColumns = ["workclass", "education", "marital_status", 
    "occupation", "relationship", "race", "sex", "native_country"]
     stages = [] # stages in our Pipeline
     for categoricalCol in categoricalColumns:
        # Category Indexing with StringIndexer
        stringIndexer = StringIndexer(inputCol=categoricalCol, 
        outputCol=categoricalCol + "Index")
        # Use OneHotEncoder to convert categorical variables into binary 
        SparseVectors
        # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
        outputCol=categoricalCol + "classVec")
        encoder = OneHotEncoderEstimator(inputCols= 
        [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        # Add stages.  These are not run here, but will run all at once later on.
        stages += [stringIndexer, encoder]
    
    numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
    "capital_loss", "hours_per_week"]
    assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler]
    
    # Create a Pipeline.
    pipeline = Pipeline(stages=stages)
    # Run the feature transformations.
    #  - fit() computes feature statistics as needed.
    #  - transform() actually transforms the features.
    pipelineModel = pipeline.fit(dataset)
    dataset = pipelineModel.transform(dataset)
    
  • 最后训练模型

    在训练和评估之后,我可以使用“model.featureImportances”来获得特征排名,但是我没有得到特征/列名称,而只是特征编号,就像这样 -
    print dtModel_1.featureImportances
    
    (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
    

  • 如何将其映射回初始列名和值?以便我可以绘制? **

    最佳答案

    将元数据提取为 shown here来自 user6910411

    attrs = sorted(
        (attr["idx"], attr["name"]) for attr in (chain(*dataset
            .schema["features"]
            .metadata["ml_attr"]["attrs"].values())))
    

    并结合特征重要性:
    [(name, dtModel_1.featureImportances[idx])
     for idx, name in attrs
     if dtModel_1.featureImportances[idx]]
    

    关于apache-spark - 列变换后的 Pyspark 随机森林特征重要性映射,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50937591/

    相关文章:

    python - Pyspark:flattem 数组列

    apache-spark - Spark Streaming forEachBatch 在写入数据库时​​给出不一致/无序的结果

    apache-spark - spark read parquet with partition filters vs 完整路径

    hive - 写入 Hive 表时使用多个 Parquet 文件(增量)

    apache-spark - Spark : How to get probabilities and AUC for Bernoulli Naive Bayes?

    python - Pyspark 从现有数组列创建一定长度的数组列

    python - PySpark 在映射 lambda 中序列化 'self' 引用对象?

    python - 查找 PySpark 中两个数据帧之间的更改

    apache-spark - Pyspark 下限和上限的运行总和/累积总和

    scala - 为什么Spark的 “Detected cartesian product for INNER join between logical plans”失败?