我正在尝试使用决策树进行分类并获得 100% 的准确率。
这是一个常见问题,描述为 here和 here 。以及许多其他问题。
数据为here .
两个最佳猜测:
- 我错误地分割了数据
- 我的数据集太不平衡
我的代码有什么问题吗?
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.3,
random_state=100)
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
# Performing training
clf_gini.fit(X_train, y_train)
# Predicton on test with giniIndex
y_pred = clf_gini.predict(X_test)
print("Predicted values:")
print(y_pred)
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ", accuracy_score(y_test, y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix: [[36095 0]
[ 0 8158]]
Accuracy : 100.0
当我打印 X 时,它显示 offer_completed
已被删除。
X.dtypes
offer_received int64
offer_viewed float64
time_viewed_received float64
time_completed_received float64
time_completed_viewed float64
transaction float64
amount float64
total_reward float64
age float64
income float64
male int64
membership_days float64
reward_each_time float64
difficulty float64
duration float64
email float64
mobile float64
social float64
web float64
bogo float64
discount float64
informational float64
最佳答案
拟合模型并检查特征重要性,您可以看到除了 total_reward
之外,它们全部为零。然后投资这样的列你会得到:
df.groupby(target)['total_reward'].describe()
count mean std min 25% 50% 75% max
0 119995 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 27513 5.74 4.07 2.0 3.0 5.0 10.0 40.0
您可以看到,对于目标 0,total_reward
始终为零,否则其值始终大于 0。这是您的泄漏。
由于可能存在其他泄漏,并且检查每一列很乏味,因此我们可以单独使用每个功能的某种“预测能力”:
acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))
for i, c in enumerate(X.columns):
clf = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
clf.fit(X_train[c].to_numpy()[:, None], y_train)
y_pred = clf.predict(X_test[c].to_numpy()[:, None])
acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]
acc_df.sort_values('acc',ascending=False)
col acc
8 total_reward 100
4 completed_time 99.8848
13 reward_each_time 89.3205
14 difficulty 89.3205
15 duration 89.3205
21 discount 86.4054
19 web 85.088
20 bogo 84.4801
3 viewed_time 84.4056
2 offer_viewed 84.3491
18 social 83.3525
1 received_time 83.0497
7 amount 82.5436
0 offer_received 81.7526
16 email 81.7526
17 mobile 81.6464
11 male 81.5651
10 income 81.5651
9 age 81.5651
6 transaction_time 81.5651
5 transaction 81.5651
22 informational 81.5651
12 membership_days 81.5561
关于python - 在 scikit-learn 中使用 DecisionTreeClassifier 修复 100% 的准确率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63279188/