这个问题was already asked a year ago on StackExchange/Stats ,但它被标记为偏离主题并在没有答复的情况下关闭。
因此,我的问题是相同的:是否有成本曲线的 Python(scikit-learn 或其他)实现,如 Cost curves: An improved method for visualizing classifier performance 中所述。 ?如果没有,考虑到真实标签、预测和可选的错误分类成本,我该如何实现它?
此方法绘制了操作点(基于正确分类正样本概率的概率成本函数)上的性能(标准化预期成本)。
在正样本和负样本的误分类成本都等于1的情况下,性能对应于错误率,而操作点是样本属于正类的概率。
最佳答案
我对此进行了研究,并且我认为我已经有了一个可行的实现。
import numpy as np
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
# %% INPUTS
# C(-|+)
cost_fn = <a scalar value>
# C(+|-)
cost_fp = <a scalar value>
# Ground truth
truth = <a list of 0 (negative class) or 1 (positive class)>
# Predictions from a classifier
score = <a list of [0,1] class probabilities>
# %% OUTPUTS
# 1D-array of x-axis values (normalized PC)
pc = None
# list of lines as (slope, intercept)
lines = []
# lower envelope of the list of lines as a 1D-array of y-axis values (NEC)
lower_envelope = []
# area under the lower envelope (the smaller, the better)
area = None
# %% COMPUTATION
# points from the roc curve, because a point in the ROC space <=> a line in the cost space
roc_fpr, roc_tpr, _ = roc_curve(truth, score)
# compute the normalized p(+)*C(-|+)
thresholds = np.arange(0, 1.01, .01)
pc = (thresholds*cost_fn) / (thresholds*cost_fn + (1-thresholds)*cost_fp)
# compute a line in the cost space for each point in the roc space
for fpr, tpr in zip(roc_fpr, roc_tpr):
slope = (1-tpr-fpr)
intercept = fpr
lines.append((slope, intercept))
# compute the lower envelope
for x_value in pc:
y_value = min([slope*x_value+intercept for slope, intercept in lines])
lower_envelope.append(max(0, y_value))
lower_envelope = np.array(lower_envelope)
# compute the area under the lower envelope using the composite trapezoidal rule
area = np.trapz(lower_envelope, pc)
# %% EXAMPLE OF PLOT
# display each line as a thin dashed line
for slope, intercept in lines:
plt.plot(pc, slope*pc+intercept, color="grey", lw=1, linestyle="--")
# display the lower envelope as a thicker black line
plt.plot(pc, lower_envelope, color="black", lw=3, label="area={:.3f}".format(area))
# plot parameters
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05*max(lower_envelope)])
plt.xlabel("Probability Cost Function")
plt.ylabel("Normalized Expected Cost")
plt.title("Cost curve")
plt.legend(loc="lower right")
plt.show()
使用cost_fn=cost_fp=1
、乳腺癌数据集和高斯朴素贝叶斯分类器分数的结果示例:
关于python - 如何使用 Python 绘制成本曲线,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56366425/