我正在使用 caret 的 twoClassSummary 函数来确定最佳模型超参数以最大化 特异性 .但是,该函数如何确定使特异性最大化的概率阈值?
基本上每个模型超参数/折叠的插入符号是否评估 0 到 1 之间的每个阈值并返回最大特异性?在下面的示例中,您可以看到模型已落在 cp = 0.01492537 上。
# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv",
number=5,
classProbs=TRUE,
summaryFunction=twoClassSummary)
set.seed(7)
fit <- train(diabetes~.,
data=PimaIndiansDiabetes,
method="rpart",
tuneLength= 5,
metric="Spec",
trControl=control)
print(fit)
CART
768 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 614, 614, 615, 615, 614
Resampling results across tuning parameters:
cp ROC Sens Spec
0.01305970 0.7615943 0.824 0.5937806
0.01492537 0.7712055 0.824 0.6016073
0.01741294 0.7544469 0.830 0.5976939
0.10447761 0.6915783 0.866 0.5035639
0.24253731 0.6437820 0.884 0.4035639
Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01492537.
最佳答案
不,twoClassSummary
不会评估 0 到 1 之间的每个阈值。它只返回标准阈值 0.5 的值。twoClassSummary
定义为:
function (data, lev = NULL, model = NULL)
{
lvls <- levels(data$obs)
if (length(lvls) > 2)
stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
requireNamespaceQuietStop("ModelMetrics")
if (!all(levels(data[, "pred"]) == lvls))
stop("levels of observed and predicted data do not match")
rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0,
1), data[, lvls[1]])
out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"],
lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
names(out) <- c("ROC", "Sens", "Spec")
out
}
要验证我的陈述,请使用自定义
summaryFunction
尝试以下示例我明确地将阈值设置为 0.5,您将看到两个值 Spec(由 twoClassSummary 报告的原始特异性)和 Spec2(阈值手动设置为 0.5 的特异性)将完全相同:# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
spec2 <- specificity(pred, data[, "obs"], "pos")
out <- c(spec, spec2)
names(out) <- c("Spec", "Spec2")
out
}
# prepare resampling method
control <- trainControl(method="cv",
number=5,
classProbs=TRUE,
summaryFunction=customSummary)
set.seed(7)
fit <- train(diabetes~.,
data=PimaIndiansDiabetes,
method="rpart",
tuneLength= 5,
metric="Spec",
trControl=control)
print(fit)
CART
768 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 615, 615, 614, 614, 614
Resampling results across tuning parameters:
cp Spec Spec2
0.01305970 0.5749825 0.5749825
0.01492537 0.5411600 0.5411600
0.01741294 0.5596785 0.5596785
0.10447761 0.4932215 0.4932215
0.24253731 0.2837177 0.2837177
Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.
此外,如果您希望 caret 计算任何阈值的超参数设置的最大特异性并报告该值,您可以定义如下自定义 summaryFunction,它将以 0.05 的步长尝试从 0.1 到 0.95 的所有阈值:
# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
spec2 <- specificity(pred, data[, "obs"], "pos")
speclist <- as.numeric()
for(i in seq(0.1, 0.95, 0.05)){
predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
singlespec <- specificity(predi, data[, "obs"], "pos")
speclist <- c(speclist, singlespec)
}
max(speclist) -> specmax
out <- c(spec, spec2, specmax)
names(out) <- c("Spec", "Spec2", "Specmax")
out
}
关于r - 插入符号训练如何确定概率阈值以最大化特异性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45749037/