machine-learning - 进行 PCA 分解后,所有分类器都给出完全相同的精度

标签 machine-learning pca

我正在运行一些机器学习代码,部分代码如下所示:

classifiers = [XGBClassifier(), DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

print("Original data")
print("=============")
print(features.shape)
for name, clf in zip(names, classifiers):
    print(name)
    X_train, X_test, y_train, y_test = train_test_split(features, loan_status, test_size = 0.2, random_state = 0)
    result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
    print(result)
    print('-----------------------------------')

print("PCA data")
print("=============")
for pca_comp in range(1,6):
    print("PCA component size: " + str(pca_comp))
    pca = decomposition.PCA(n_components=pca_comp)
    pca.fit(features)
    features_pca = pca.transform(features)
    for name, clf in zip(names, classifiers):
        X_train, X_test, y_train, y_test = train_test_split(features_pca, loan_status, test_size = 0.2, random_state = 0)
        result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
        print(result)
        print('-----------------------------------')

实际上,我正在迭代多个分类器并打印它们的结果。 然后我迭代不同的 n_component 大小进行 PCA 分解,然后再次在所有分类器上运行。

我发现,一旦开始进行 PCA,无论我使用什么分类器或选择什么 n_component 值,准确性(acc_test 和 acc_train)都保持不变。

这是这部分代码的输出。 请注意,一旦 PCA 启动,“acc_test”始终为 0.8079021551332182。

不幸的是,我无法共享数据。 但是,我正在寻找代码中明显错误的地方。

谢谢

Original data
=============
(769790, 207)
XGBoost
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 273.7087504863739, 'pred_time': 4.388766288757324, 'acc_train': 0.848625923953286, 'acc_test': 0.8481793735953962, 'f_train': 0.877928251001055, 'f_test': 0.8775348027423189}
-----------------------------------
Decision Tree
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 11.388459920883179, 'pred_time': 0.38187479972839355, 'acc_train': 0.8347195338988556, 'acc_test': 0.8338183140856598, 'f_train': 0.8735138626721308, 'f_test': 0.8728762797972536}
-----------------------------------
Random Forest
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 1.3620502948760986, 'pred_time': 0.8454875946044922, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
Neural Net
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 130.09251832962036, 'pred_time': 8.788004636764526, 'acc_train': 0.810022863378324, 'acc_test': 0.8106106860312553, 'f_train': 0.8429408284567822, 'f_test': 0.84336348394109}
-----------------------------------
AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 114.49720454216003, 'pred_time': 6.846264839172363, 'acc_train': 0.8319898933475364, 'acc_test': 0.830836981514439, 'f_train': 0.8676524880554248, 'f_test': 0.866917350579005}
-----------------------------------
Naive Bayes
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 2.338545322418213, 'pred_time': 2.913602828979492, 'acc_train': 0.696707868379688, 'acc_test': 0.6979565855622962, 'f_train': 0.8374139063372146, 'f_test': 0.8381986507744102}
-----------------------------------
QDA
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 17.64940857887268, 'pred_time': 6.382497072219849, 'acc_train': 0.5545554631782694, 'acc_test': 0.5551124332610192, 'f_train': 0.7616845459479327, 'f_test': 0.7619965387905216}
-----------------------------------
PCA data
=============
PCA component size: 1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 12.907331943511963, 'pred_time': 2.0308330059051514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.6030781269073486, 'pred_time': 0.03420734405517578, 'acc_train': 0.8074718429701607, 'acc_test': 0.8079021551332182, 'f_train': 0.8398076830188118, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.2026519775390625, 'pred_time': 0.5144689083099365, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.960830450057983, 'pred_time': 0.7337024211883545, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 9.310431957244873, 'pred_time': 2.949209451675415, 'acc_train': 0.807460476233778, 'acc_test': 0.8078956598552852, 'f_train': 0.8398003208188749, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.028026819229125977, 'pred_time': 0.019958019256591797, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.039576053619384766, 'pred_time': 0.021703481674194336, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 2
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 17.529640436172485, 'pred_time': 2.1811327934265137, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.9235944747924805, 'pred_time': 0.03514695167541504, 'acc_train': 0.8074588524142948, 'acc_test': 0.8079021551332182, 'f_train': 0.8397974448899658, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.8425581455230713, 'pred_time': 0.519752025604248, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 17.796229362487793, 'pred_time': 1.4105899333953857, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 14.433330059051514, 'pred_time': 2.9874980449676514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09282994270324707, 'pred_time': 0.06884241104125977, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.06534266471862793, 'pred_time': 0.06316208839416504, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 3
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 22.586288690567017, 'pred_time': 2.132150650024414, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.3756062984466553, 'pred_time': 0.0391697883605957, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.6991543769836426, 'pred_time': 0.5463252067565918, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.745409488677979, 'pred_time': 1.617872714996338, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 18.745909929275513, 'pred_time': 3.02945613861084, 'acc_train': 0.8074539809558451, 'acc_test': 0.8078956598552852, 'f_train': 0.8397946213935711, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09948086738586426, 'pred_time': 0.07936644554138184, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.07803058624267578, 'pred_time': 0.07502388954162598, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 4
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 28.096595287322998, 'pred_time': 2.079728364944458, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.9280765056610107, 'pred_time': 0.04021263122558594, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.067602872848511, 'pred_time': 0.5436885356903076, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 18.260048389434814, 'pred_time': 2.397339344024658, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 24.486289501190186, 'pred_time': 3.059351921081543, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.10924768447875977, 'pred_time': 0.08964681625366211, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.09738326072692871, 'pred_time': 0.08622312545776367, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------

最佳答案

我在你的代码中没有发现任何明显的错误。

一些想法:

当您将 n_components 降低到 1 时,我预计分类器会越来越相似。但与您观察到的并不相同。

您仅在 (1,6) PCA 组件之间循环。通过循环可能的 (1,10,20,30,100) 组件来验证分类器是否正确训练。如果分类器仍然具有相同的性能,那么您就做错了 -

还可以查看并手动验证在 PCA 转换期间功能没有发生奇怪的事情。只需执行相同的代码并查看新功能直方图......可能会发生一些奇怪的事情。

检查解释的差异并确保附加组件正在添加信息。 打印(pca.explained_variance_ratio_)

鉴于分类器与所有 207 个特征 非常相似,一旦您运行 PCA,它们可能只是看到相同的东西。

使用默认参数(即非常简单的分类器),分类器有可能(但不太可能)在 (1,6) 组件上表现相同。

还要确保你的循环正确(看起来是这样)并进行一些健全性检查。祝你好运!

关于machine-learning - 进行 PCA 分解后,所有分类器都给出完全相同的精度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53507952/

相关文章:

c# - OpenCvSharp PCA 异常 : Unsupported combination of input and output array formats

python - 尝试理解 3 层神经网络中的梯度检查错误

pandas - 如何将索引向量更改为可在 sklearn 中使用的稀疏特征向量?

machine-learning - 在 scikit 中使用 libsvm 格式

随机森林前降维的PCA

Python Sklearn 协方差矩阵对角线条目不正确?

r - prcomp : PCA residuals not zero

machine-learning - 哪种深度学习模型可以对不互斥的类别进行分类

algorithm - 是否有用于多参数预测的特殊类型的多元回归?

python-3.x - 使用更多类别重新训练现有构建的 keras cnn 序列模型(预测 16 个类别)