machine-learning - 如何在 Scikit-learn 中使用固定验证集(不是 K 折交叉验证)作为决策树分类器/随机森林分类器？

我是机器学习和数据科学的新手。抱歉，如果这是一个非常愚蠢的问题。

我看到有一个用于交叉验证的内置函数，但没有用于固定验证集。我有一个包含 50,000 个样本的数据集，标记为 1990 到 2010 年的年份。我需要在 1990-2008 年的样本上训练不同的分类器，然后在 2009 年的样本上进行验证，并在 2010 年的样本上进行测试。

编辑: 在@Quan Tran 的回答之后，我尝试了这个。事情应该是这样吗？

# Fit a decision tree
estimator1 = DecisionTreeClassifier( max_depth = 9, max_leaf_nodes=9)
estimator1.fit(X_train, y_train)
print estimator1


# validate using validation set
acc = np.zeros((20,20))  # store accuracy 
for i in range(20):
     for j in range(20):
         estimator1 = DecisionTreeClassifier(max_depth = i+1, max_leaf_nodes=j+2)
         estimator1.fit(X_valid, y_valid)
         y_pred = estimator1.predict(X_valid)
         acc[i,j] = accuracy_score(y_valid, y_pred)

best_mod = np.where(acc == acc.max())
print best_mod
print acc[best_mod]



 # Predict target values
estimator1 = DecisionTreeClassifier(max_depth = int(best_mod[0]) + 1, max_leaf_nodes= int(best_mod[1]) + 2)
estimator1.fit(X_valid, y_valid)
y_pred = estimator1.predict(X_test)
confusion = metrics.confusion_matrix(y_test, y_pred)

TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]


# Classification Accuracy
print "======= ACCURACY ========"
print((TP + TN) / float(TP + TN + FP + FN))
print accuracy_score(y_valid, y_pred)
# store the predicted probabilities for class 
y_pred_prob = estimator1.predict_proba(X_test)[:, 1]


# plot a ROC curve for y_test and y_pred_prob
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for DecisionTreeClassifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

print("======= AUC ========")
print(metrics.roc_auc_score(y_test, y_pred_prob))

我得到的答案不是最好的准确性。

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
        max_features=None, max_leaf_nodes=9, min_samples_leaf=1,
        min_samples_split=2, min_weight_fraction_leaf=0.0,
        presort=False, random_state=None, splitter='best')
(array([5]), array([19]))
[ 0.8489011]
======= ACCURACY ========
0.574175824176
0.538461538462
======= AUC ========
0.547632099893

最佳答案

在本例中，存在三个独立的集合。训练集、测试集和验证集。

训练集用于拟合分类器的参数。例如:

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(trainfeatures, labels)

验证集用于调整分类器的超参数或找到训练过程的截止点。例如，在决策树的情况下，max_深度是一个超参数。您需要通过试验不同的超参数值(调整)来找到一组好的超参数，并比较验证集上的性能指标(准确度/精度等)。

测试集用于估计未见数据的错误率。在测试集上进行性能测量后，不得进一步训练/调整模型。

关于machine-learning - 如何在 Scikit-learn 中使用固定验证集(不是 K 折交叉验证)作为决策树分类器/随机森林分类器？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40179513/

machine-learning - 如何在 Scikit-learn 中使用固定验证集(不是 K 折交叉验证)作为决策树分类器/随机森林分类器？

上一篇：vagrant - Ansible & Vagrant - 为 ansible 提供参数

下一篇：reactjs - 将服务器端渲染添加到 create-react-app