r - 插入符号训练如何确定概率阈值以最大化特异性

我正在使用 caret 的 twoClassSummary 函数来确定最佳模型超参数以最大化 特异性 .但是，该函数如何确定使特异性最大化的概率阈值？

基本上每个模型超参数/折叠的插入符号是否评估 0 到 1 之间的每个阈值并返回最大特异性？在下面的示例中，您可以看到模型已落在 cp = 0.01492537 上。

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=twoClassSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)


CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results across tuning parameters:

  cp          ROC        Sens   Spec     
  0.01305970  0.7615943  0.824  0.5937806
  0.01492537  0.7712055  0.824  0.6016073
  0.01741294  0.7544469  0.830  0.5976939
  0.10447761  0.6915783  0.866  0.5035639
  0.24253731  0.6437820  0.884  0.4035639

Spec was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01492537.

最佳答案

不，twoClassSummary不会评估 0 到 1 之间的每个阈值。它只返回标准阈值 0.5 的值。
twoClassSummary定义为:

 function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

要验证我的陈述，请使用自定义 summaryFunction 尝试以下示例我明确地将阈值设置为 0.5，您将看到两个值 Spec(由 twoClassSummary 报告的原始特异性)和 Spec2(阈值手动设置为 0.5 的特异性)将完全相同:

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)

# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  out <- c(spec, spec2)

  names(out) <- c("Spec", "Spec2")
  out
}

# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=customSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)
CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 615, 615, 614, 614, 614 
Resampling results across tuning parameters:

  cp          Spec       Spec2    
  0.01305970  0.5749825  0.5749825
  0.01492537  0.5411600  0.5411600
  0.01741294  0.5596785  0.5596785
  0.10447761  0.4932215  0.4932215
  0.24253731  0.2837177  0.2837177

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.

此外，如果您希望 caret 计算任何阈值的超参数设置的最大特异性并报告该值，您可以定义如下自定义 summaryFunction，它将以 0.05 的步长尝试从 0.1 到 0.95 的所有阈值:

    # define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  speclist <- as.numeric()
  for(i in seq(0.1, 0.95, 0.05)){
    predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
    singlespec <- specificity(predi, data[, "obs"], "pos")
    speclist <- c(speclist, singlespec)
  }
  max(speclist) -> specmax

  out <- c(spec, spec2, specmax)

  names(out) <- c("Spec", "Spec2", "Specmax")
  out
}

关于r - 插入符号训练如何确定概率阈值以最大化特异性，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45749037/

r - 插入符号训练如何确定概率阈值以最大化特异性

上一篇：rest - 在正文或 token 中传递用户 ID

下一篇：haskell - 在实现 MonadIO 的 Monad 中嵌入异步