python - 如何为 catboost 创建自定义评估指标?

标签 python scikit-learn catboost

类似的问题:

  • Python Catboost: Multiclass F1 score custom metric

  • Catboost 教程
  • https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function

  • 问题
    在这个问题中,我有一个二元分类问题。建模后,我们得到了测试模型预测 y_pred 并且我们已经有了真正的测试标签 y_true
    我想获得由以下等式定义的自定义评估指标:
    profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive
    
    此外,由于更高的利润更好,我想最大化函数而不是最小化它。
    如何在 catboost 中获得这个 eval_metric?
    使用 sklearn
    def get_profit(y_true, y_pred):
        tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
        loss = 400*tp - 200*fn - 100*fp
        return loss
    
    scoring = sklearn.metrics.make_scorer(get_profit, greater_is_better=True)
    
    使用 catboost
    class ProfitMetric(object):
        def get_final_error(self, error, weight):
            return error / (weight + 1e-38)
    
        def is_max_optimal(self):
            return True
    
        def evaluate(self, approxes, target, weight):
            assert len(approxes) == 1
            assert len(target) == len(approxes[0])
    
            approx = approxes[0]
    
            error_sum = 0.0
            weight_sum = 0.0
    
            ** I don't know here**
    
            return error_sum, weight_sum
    
    问题
    如何在 catboost 中完成自定义评估指标?
    更新
    到目前为止我的更新
    import numpy as np
    import pandas as pd
    import seaborn as sns
    import sklearn
    
    from catboost import CatBoostClassifier
    from sklearn.model_selection import train_test_split
    
    def get_profit(y_true, y_pred):
        tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
        profit = 400*tp - 200*fn - 100*fp
        return profit
    
    
    class ProfitMetric:
        def is_max_optimal(self):
            return True # greater is better
    
        def evaluate(self, approxes, target, weight):
            assert len(approxes) == 1
            assert len(target) == len(approxes[0])
    
            approx = approxes[0]
    
            y_pred = np.rint(approx)
            y_true = np.array(target).astype(int)
    
            output_weight = 1 # weight is not used
    
            score = get_profit(y_true, y_pred)
     
            return score, output_weight
    
        def get_final_error(self, error, weight):
            return error
    
    
    df = sns.load_dataset('titanic')
    X = df[['survived','pclass','age','sibsp','fare']]
    y = X.pop('survived')
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
    
    
    model = CatBoostClassifier(metric_period=50,
      n_estimators=200,
      eval_metric=ProfitMetric()
    )
    
    model.fit(X, y, eval_set=(X_test, y_test)) # this fails
    

    最佳答案

    与您的主要区别在于:

    @staticmethod
    def get_profit(y_true, y_pred):
        y_pred = expit(y_pred).astype(int)
        y_true = y_true.astype(int)
        #print("ACCURACY:",(y_pred==y_true).mean())
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        loss = 400*tp - 200*fn - 100*fp
        return loss
    
    example 中您链接的预测并不明显,但在检查后发现 catboost 在内部将预测视为原始对数赔率(帽子提示 @Ben)。因此,要正确使用 confusion_matrix,您需要确保 y_truey_pred 都是整数类标签。这是通过以下方式完成的:
    y_pred = scipy.special.expit(y_pred) 
    y_true = y_true.astype(int)
    
    所以完整的工作代码是:
    import seaborn as sns
    from catboost import CatBoostClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    from scipy.special import expit
    
    df = sns.load_dataset('titanic')
    X = df[['survived','pclass','age','sibsp','fare']]
    y = X.pop('survived')
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
    
    class ProfitMetric:
        
        @staticmethod
        def get_profit(y_true, y_pred):
            y_pred = expit(y_pred).astype(int)
            y_true = y_true.astype(int)
            #print("ACCURACY:",(y_pred==y_true).mean())
            tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
            loss = 400*tp - 200*fn - 100*fp
            return loss
        
        def is_max_optimal(self):
            return True # greater is better
    
        def evaluate(self, approxes, target, weight):            
            assert len(approxes) == 1
            assert len(target) == len(approxes[0])
            y_true = np.array(target).astype(int)
            approx = approxes[0]
            score = self.get_profit(y_true, approx)
            return score, 1
    
        def get_final_error(self, error, weight):
            return error
    
    model = CatBoostClassifier(metric_period=50,
      n_estimators=200,
      eval_metric=ProfitMetric()
    )
    
    model.fit(X, y, eval_set=(X_test, y_test))
    

    关于python - 如何为 catboost 创建自定义评估指标?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65462220/

    相关文章:

    python - 如何从 Unix 纪元时间转换并考虑 Python 中的夏令时?

    python - 在多个应用程序之间访问单个模型 - Django

    python - 使用谷歌应用程序引擎和Python,如何检查复选框是否被标记?

    python - 如何可视化用于 kmeans 聚类的 tf-idf 向量的数据点?

    python - 带有 rbf 内核的前 10 个功能 SVC

    python - CatBoost:我们是否过度拟合?

    python - 如何获得 catboost 可视化以显示类别

    python - TypeError : can't multiply sequence by non-int of type 'float' , 我无法弄清楚

    python - 即使使用全新的 Anaconda 安装,使用依赖于 scipy 的包也会引发 ImportError(DLL 加载失败)

    python - 如何将 catboosts 原始预测分数 (RawFormulaVal) 转换为概率?