python - scikit-learn 与 make_scorer 的斗争

标签 python machine-learning scikit-learn data-science

我必须在医学数据集上实现分类算法。所以我认为对疾病识别有良好的内存是至关重要的。我想实现这样的记分器

recall_scorer = make_scorer(recall_score(y_true = , y_pred = , \
labels =['compensated_hypothyroid', 'primary_hypothyroid'], average = 'macro'))

但是,我想在 GridSearchCV 中使用这个记分器,所以它适合我的 KFold。所以,我不知道如何初始化记分器,因为它需要立即传递 y_true 和 y_pred。

我该如何解决这个问题?我要编写自己的超参数调整吗?

最佳答案

根据您的评论,计算Cross-Validation期间的召回率 Scikit-learn两个类的迭代是可行的。

考虑这个数据集示例:

dataset example


您可以使用make_scorer函数在 Cross-Validation 期间获取元数据如下:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np


def getDataset(path, x_attr, y_attr, mapping):
    """
    Extract dataset from CSV file
    :param path: location of csv file
    :param x_attr: list of Features Names
    :param y_attr: Y header name in CSV file
    :param mapping: dictionary of the classes integers
    :return: tuple, (X, Y)
    """
    df = pd.read_csv(path)
    df.replace(mapping, inplace=True)
    X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
    Y = np.array(df[y_attr])
    return X, Y


def custom_recall_score(y_true, y_pred):
    """
    Workaround for the recall score
    :param y_true: Ground Truth during iterations
    :param y_pred: Y predicted during iterations
    :return: float, recall
    """
    wanted_labels = [0, 1]
    assert set(wanted_labels).issubset(y_true)
    wanted_indices = [y_true.tolist().index(x) for x in wanted_labels]
    wanted_y_true = [y_true[x] for x in wanted_indices]
    wanted_y_pred = [y_pred[x] for x in wanted_indices]
    recall_ = recall_score(wanted_y_true, wanted_y_pred,
                           labels=wanted_labels, average='macro')
    print("Wanted Indices: {}".format(wanted_indices))
    print("Wanted y_true: {}".format(wanted_y_true))
    print("Wanted y_pred: {}".format(wanted_y_pred))
    print("Recall during cross validation: {}".format(recall_))
    return recall_


def run(X_data, Y_data):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
    train_index, test_index = next(sss.split(X_data, Y_data))
    X_train, X_test = X_data[train_index], X_data[test_index]
    Y_train, Y_test = Y_data[train_index], Y_data[test_index]
    param_grid = {'C': [0.1, 1]} # or whatever parameter you want
    # I am using LR just for example
    model = LogisticRegression(solver='saga', random_state=0)
    clf = GridSearchCV(model, param_grid,
                       cv=StratifiedKFold(n_splits=2),
                       return_train_score=True,
                       scoring=make_scorer(custom_recall_score))
    clf.fit(X_train, Y_train)
    print(clf.cv_results_)


X_data, Y_data = getDataset("dataset_example.csv", ['TSH', 'T4'], 'diagnosis',
                            {'compensated_hypothyroid': 0, 'primary_hypothyroid': 1,
                             'hyperthyroid': 2, 'normal': 3})
run(X_data, Y_data)

结果示例

Wanted Indices: [3, 5]
Wanted y_true: [0, 1]
Wanted y_pred: [3, 3]
Recall during cross validation: 0.0
...
...
Wanted Indices: [0, 4]
Wanted y_true: [0, 1]
Wanted y_pred: [1, 1]
Recall during cross validation: 0.5
...
...
{'param_C': masked_array(data=[0.1, 1], mask=[False, False],
  fill_value='?', dtype=object), 
  'mean_score_time': array([0.00094521, 0.00086224]), 
  'mean_fit_time': array([0.00298035, 0.0023526 ]), 
  'std_score_time': array([7.02142715e-05, 1.78813934e-06]), 
  'mean_test_score': array([0.21428571, 0.5       ]), 
  'std_test_score': array([0.24743583, 0.        ]), 
  'params': [{'C': 0.1}, {'C': 1}], 
  'mean_train_score': array([0.25, 0.5 ]), 
  'std_train_score': array([0.25, 0.  ]), 
  ....
  ....}

警告

必须使用StratifiedShuffleSplitStratifiedKFold并在数据集中拥有平衡类,以确保迭代期间类的分层分布,否则 assertion以上可能会提示!

关于python - scikit-learn 与 make_scorer 的斗争,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53379631/

相关文章:

machine-learning - 在明确划分类别之前获取 SMO 分类的准确输出值

python - 如何将不平衡库与 sklearn pipeline 一起使用?

machine-learning - 二元机器学习分类的置信概率

python - Tkinter 将矩形小部件下方的小部件绑定(bind)到鼠标事件

Python:拟合误差函数(erf)或类似于数据

python - 在 Python 中传递具有多个返回值的函数作为参数

python - KFolds 交叉验证与 train_test_split

Python(pandas)-用计数重置索引

machine-learning - 线性回归的梯度下降(机器学习第 1 周,作者:Ng Andrew)

python - scikit学习: random forest classifier giving ValueError