python - 计算训练集的混淆矩阵

标签 python machine-learning cross-validation knn

我是机器学习的新手。最近,我学习了如何为KNN 分类测试集 计算confusion_matrix。但是我不知道,KNN分类Training set如何计算confusion_matrix

如何根据以下代码为KNN 分类训练集 计算confusion_matrix

以下代码用于计算测试集confusion_matrix:

# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)

# Predicting the Test set results
y_pred = knn.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.

对于 k 折交叉验证:

我还尝试使用k-fold 交叉验证Training set 找到confusion_matrix

我对这一行 knn.fit(X_train, y_train) 感到困惑。

我是否会更改此行 knn.fit(X_train, y_train)

我应该在哪里更改以下代码以计算training setconfusion_matrix

# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)

skf = StratifiedKFold(y, kfold, random_state = 0)

# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution 
# Note: in future versions of scikit.learn, this module will be fused with kfold

skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
    skfind[cnt] = train_index
    cnt = cnt + 1

# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation

from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix

n_neighbors = 1; # better to have this at the start of the code

# 10-fold Cross Validation


for i in range(kfold):
    train_indices = skfind[i][0]
    test_indices = skfind[i][1]

    clf = []
    clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]

    # fit Training set
    clf.fit(X_train,y_train) 


    # predict Test data
    y_predcit_test = []
    y_predict_test = clf.predict(X_test) # output is labels and not indices

    # Compute confusion matrix
    cm = []
    cm = confusion_matrix(y_test,y_predict_test)
    print(cm)
    # conf_mat = conf_mat + cm 

最佳答案

你不必做太多的改变

# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)

这里我们使用 X_train 代替 X_test 进行分类,然后使用训练数据集的预测类和实际类生成分类矩阵。

分类矩阵背后的思想本质上是找出分为四类的分类数(如果 y 是二进制的)-

  1. 预测为真但实际为假
  2. 预测为真,实际为真
  3. 预测为假但实际上为真
  4. 预测为假,实际为假

因此,只要您有两组——预测的和实际的,就可以创建混淆矩阵。您所要做的就是预测类,并使用实际类来获取混淆矩阵。

编辑

在交叉验证部分,可以添加一行y_predict_train = clf.predict(X_train)来计算每次迭代的混淆矩阵。您可以这样做,因为在循环中,您每次都初始化 clf,这基本上意味着重置您的模型。

此外,在您的代码中,您每次都会找到混淆矩阵,但您并没有将其存储在任何地方。最后,您将只剩下一厘米的最后一个测试集。

关于python - 计算训练集的混淆矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45854885/

相关文章:

python - 我可以在 sympy 中使用符号叉积运算吗

machine-learning - 随机森林: how to favor false negatives over false positives

python - TensorFlow Estimators 中的 'batches' 和 'steps' 是什么?它们与 epoch 有何不同?

machine-learning - 回归中的 scikit-learn 交叉验证分数

python - 在 Python 上使用 OpenCV 选择最佳 SVM 内核类型和参数

python - 存储一对需要在 Python 中经常更新的值的最佳方法?

python - 当我执行 pip --version 时,它显示错误为 ImportError : No module named pyparsing

python - ValueError:层激活_1调用的输入不是符号张量

python - 短文本情感分类任务所需的最小训练集大小是多少

python - Sklearn StratifiedKFold : ValueError: Supported target types are: ('binary' , 'multiclass' )。取而代之的是 'multilabel-indicator'