python - xgboost ML 模型的 get_fscore() 有什么作用？

标签 python feature-selection xgboost

<分区>

有人知道这些数字是如何计算的吗？在文档中它说这个函数“获取每个特征的特征重要性”，但没有解释如何解释结果。

最佳答案

这是一个指标，它简单地总结了每个特征被分割的次数。它类似于 R 版本中的频率指标。 https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

它是您所能获得的最基本的特征重要性指标。

即这个变量 split 了多少次？

此方法的代码显示它只是在所有树中添加给定特征的存在。

[这里.. https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L953][1]

def get_fscore(self, fmap=''):
    """Get feature importance of each feature.
    Parameters
    ----------
    fmap: str (optional)
       The name of feature map file
    """
    trees = self.get_dump(fmap)  ## dump all the trees to text
    fmap = {}                    
    for tree in trees:              ## loop through the trees
        for line in tree.split('\n'):     # text processing
            arr = line.split('[')
            if len(arr) == 1:             # text processing 
                continue
            fid = arr[1].split(']')[0]    # text processing
            fid = fid.split('<')[0]       # split on the greater/less(find variable name)

            if fid not in fmap:  # if the feature id hasn't been seen yet
                fmap[fid] = 1    # add it
            else:
                fmap[fid] += 1   # else increment it
    return fmap                  # return the fmap, which has the counts of each time a  variable was split on

关于python - xgboost ML 模型的 get_fscore() 有什么作用？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33652224/

上一篇：python - 训练 TensorFlow 预测 csv 文件中的列

下一篇：python - numpy.longdouble dtype 的 timedelta 错误

相关文章：

python - 在 Python 中从 k-d-Tree 中删除根

python - 如何强制 django 打印每个执行的 sql 查询

Python:sci-kit 中的特征选择学习正态分布

python - 使用 tsfresh 仅选择一定数量的顶级功能

python - Hadoop 流 : Mapper 'wrapping' a binary executable

python - 根据缩略图选择标签，更新到Sqlite

r - 为什么 R 中的 grpreg 库和 gglasso 库对组 LASSO 给出不同的结果？

machine-learning - 过度拟合总是一件坏事吗？

numpy - XGBoost:检查失败:有效:输入数据包含 `inf` 或 `nan`

python - 如何将 xgboost 集成到 Spark 中？ (Python)