python - 在 scikit-learn 中使用 DecisionTreeClassifier 修复 100% 的准确率

标签 python scikit-learn

我正在尝试使用决策树进行分类并获得 100% 的准确率。

这是一个常见问题,描述为 herehere 。以及许多其他问题。

数据为here .

两个最佳猜测:

  • 我错误地分割了数据
  • 我的数据集太不平衡

我的代码有什么问题吗?

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 

# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']

# Splitting the dataset into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size=0.3,
                                                    random_state=100) 

# Creating the classifier object 
clf_gini = DecisionTreeClassifier(criterion = "gini", 
                                  random_state = 100, 
                                  max_depth = 3, 
                                  min_samples_leaf = 5) 

# Performing training 
clf_gini.fit(X_train, y_train)

# Predicton on test with giniIndex 
y_pred = clf_gini.predict(X_test) 
print("Predicted values:") 
print(y_pred) 

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred)) 

print ("Accuracy : ", accuracy_score(y_test, y_pred)*100) 

print("Report : ", classification_report(y_test, y_pred)) 

y_pred_gini = prediction(X_test, clf_gini) 
cal_accuracy(y_test, y_pred_gini) 


Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix:  [[36095     0]
                    [    0  8158]]
Accuracy :  100.0

当我打印 X 时,它显示 offer_completed 已被删除。

X.dtypes

offer_received               int64
offer_viewed               float64
time_viewed_received       float64
time_completed_received    float64
time_completed_viewed      float64
transaction                float64
amount                     float64
total_reward               float64
age                        float64
income                     float64
male                         int64
membership_days            float64
reward_each_time           float64
difficulty                 float64
duration                   float64
email                      float64
mobile                     float64
social                     float64
web                        float64
bogo                       float64
discount                   float64
informational              float64

最佳答案

拟合模型并检查特征重要性,您可以看到除了 total_reward 之外,它们全部为零。然后投资这样的列你会得到:

df.groupby(target)['total_reward'].describe()
    count   mean    std    min   25%    50%   75%    max
0   119995  0.0     0.0    0.0   0.0    0.0   0.0    0.0
1   27513   5.74    4.07   2.0   3.0    5.0   10.0   40.0

您可以看到,对于目标 0,total_reward 始终为零,否则其值始终大于 0。这是您的泄漏。

由于可能存在其他泄漏,并且检查每一列很乏味,因此我们可以单独使用每个功能的某种“预测能力”:

acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))

for i, c in enumerate(X.columns):

    clf = DecisionTreeClassifier(criterion = "gini", 
                                 random_state = 100, 
                                 max_depth = 3, 
                                 min_samples_leaf = 5) 
    
    clf.fit(X_train[c].to_numpy()[:, None], y_train)
    
    y_pred = clf.predict(X_test[c].to_numpy()[:, None])
    acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]


acc_df.sort_values('acc',ascending=False)
                 col      acc
8       total_reward      100
4     completed_time  99.8848
13  reward_each_time  89.3205
14        difficulty  89.3205
15          duration  89.3205
21          discount  86.4054
19               web   85.088
20              bogo  84.4801
3        viewed_time  84.4056
2       offer_viewed  84.3491
18            social  83.3525
1      received_time  83.0497
7             amount  82.5436
0     offer_received  81.7526
16             email  81.7526
17            mobile  81.6464
11              male  81.5651
10            income  81.5651
9                age  81.5651
6   transaction_time  81.5651
5        transaction  81.5651
22     informational  81.5651
12   membership_days  81.5561

关于python - 在 scikit-learn 中使用 DecisionTreeClassifier 修复 100% 的准确率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63279188/

相关文章:

python - 如何使用anaconda在window上安装tensorflow

python - Python 数据框中的多重切片

python - python中将数据缩放到特定范围

python - 如何调整/选择 AffinityPropagation 的偏好参数?

python - 在模块级别或功能级别导入?

python - python 和 debian 9 的 Unicode 版本

使用单个文件/已经构建的文件访问Python数据库

python - CountVectorizer 忽略 'I'

python - 如何使用 sklearn.decomposition FactorAnalysis 在 python 中获取因子加载

python - 如何在选择前 5k 个特征后准备我的数据集。原始形状是 (24500,56000)。预期 =(24k,5k)