r - 插入符号训练如何确定概率阈值以最大化特异性

标签 r r-caret

我正在使用 caret 的 twoClassSummary 函数来确定最佳模型超参数以最大化 特异性 .但是,该函数如何确定使特异性最大化的概率阈值?

基本上每个模型超参数/折叠的插入符号是否评估 0 到 1 之间的每个阈值并返回最大特异性?在下面的示例中,您可以看到模型已落在 cp = 0.01492537 上。

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=twoClassSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)


CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results across tuning parameters:

  cp          ROC        Sens   Spec     
  0.01305970  0.7615943  0.824  0.5937806
  0.01492537  0.7712055  0.824  0.6016073
  0.01741294  0.7544469  0.830  0.5976939
  0.10447761  0.6915783  0.866  0.5035639
  0.24253731  0.6437820  0.884  0.4035639

Spec was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01492537.

最佳答案

不,twoClassSummary不会评估 0 到 1 之间的每个阈值。它只返回标准阈值 0.5 的值。
twoClassSummary定义为:

 function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

要验证我的陈述,请使用自定义 summaryFunction 尝试以下示例我明确地将阈值设置为 0.5,您将看到两个值 Spec(由 twoClassSummary 报告的原始特异性)和 Spec2(阈值手动设置为 0.5 的特异性)将完全相同:
# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)

# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  out <- c(spec, spec2)

  names(out) <- c("Spec", "Spec2")
  out
}

# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=customSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)
CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 615, 615, 614, 614, 614 
Resampling results across tuning parameters:

  cp          Spec       Spec2    
  0.01305970  0.5749825  0.5749825
  0.01492537  0.5411600  0.5411600
  0.01741294  0.5596785  0.5596785
  0.10447761  0.4932215  0.4932215
  0.24253731  0.2837177  0.2837177

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.

此外,如果您希望 caret 计算任何阈值的超参数设置的最大特异性并报告该值,您可以定义如下自定义 summaryFunction,它将以 0.05 的步长尝试从 0.1 到 0.95 的所有阈值:
    # define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  speclist <- as.numeric()
  for(i in seq(0.1, 0.95, 0.05)){
    predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
    singlespec <- specificity(predi, data[, "obs"], "pos")
    speclist <- c(speclist, singlespec)
  }
  max(speclist) -> specmax

  out <- c(spec, spec2, specmax)

  names(out) <- c("Spec", "Spec2", "Specmax")
  out
}

关于r - 插入符号训练如何确定概率阈值以最大化特异性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45749037/

相关文章:

r - 如何使用插入符比较不同的模型,调整不同的参数?

r - 多核插入符包的性能极其缓慢

r - 使用 RCurl 抓取 javascript

sql-server - 在Linux Centos 6.6上使用R连接到SQL Server

r - 如何使用 `regex` 仅针对没有它的特定字符串将 % 符号添加到字符串

r - 评估 lm,其中训练数据已使用虚拟变量自动转换

r - Caret 包中的预测函数给出错误

r - 使用小于2个非NA值的svyquantile()

r - For 循环 : How could I repeat several steps?

r - 如何将 ntree 参数强加到 caret 包的 train() 函数中?