machine-learning - SVM 模型将概率得分大于 0.1(默认阈值 0.5)的实例预测为正例

我正在研究二元分类问题。我遇到过这种情况，我使用了从sklearn导入的逻辑回归和支持向量机模型。这两个模型使用相同的、不平衡的训练数据进行拟合，并对类别权重进行了调整。他们也取得了可比的表现。当我使用这两个预训练模型来预测新数据集时。 LR 模型和 SVM 模型预测的阳性实例数量相似。并且预测的实例有很大的重叠。

但是，当我查看被分类为正例的概率分数时，LR 的分布是从 0.5 到 1，而 SVM 的分布是从 0.1 左右开始。我调用函数 model.predict(prediction_data) 来找出每个类和函数预测的实例 model.predict_proba(prediction_data) 给出被分类为 0(neg) 和 1(pos) 的概率分数，并假设它们都有默认阈值 0.5。

我的代码中没有错误，我不知道为什么 SVM 也将概率分数 < 0.5 的实例预测为正值。关于如何解释这种情况有什么想法吗？

最佳答案

当涉及到 SVC() 的二元分类问题时，这是 sklearn 中的一个已知事实，例如，在这些 github 问题中进行了报告 (here和here)。此外，它还在 User guide 中报告说:

In addition, the probability estimates may be inconsistent with the scores: the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.

或者直接在 libsvm faq 内，据说

Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.

总而言之，重点是:

一方面，预测基于 decision_function 值:如果在新实例上计算的决策值为正，则预测的类就是正类，反之亦然。
另一方面，如 github 问题之一所述，np.argmax(self.predict_proba(X), axis=1) != self.predict(X)这就是不一致的根源。换句话说，为了在二元分类问题上始终保持一致性，您需要一个分类器，其预测基于 predict_proba() 的输出(顺便说一句，这是您在考虑 calibrators 时得到的结果) ，像这样:
```
 def predict(self, X):
     y_proba = self.predict_proba(X)
     return np.argmax(y_proba, axis=1)
```

我还建议关于该主题的 this post。

关于machine-learning - SVM 模型将概率得分大于 0.1(默认阈值 0.5)的实例预测为正例，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68475534/

machine-learning - SVM 模型将概率得分大于 0.1(默认阈值 0.5)的实例预测为正例

上一篇：javascript - 使用自定义 SVGIcons 响应导航(选项卡导航器)

下一篇：docker - .Net Core Dockerized WebAPI 无法注册到 Eureka 服务器