我有一个数据集,想要对 pandas 数据帧的子集应用缩放,然后应用 PCA,并仅返回未转换的组件和列。因此,使用来自seaborn的mpg
数据集,我可以看到尝试预测mpg的训练集如下所示:
现在假设我想保留气缸和排量,并缩放其他所有内容并将其减少为 2 个分量。我预计结果总共为 4 列,即原始 2 列加上 2 个分量。
如何使用 ColumnTransformer
缩放列的子集,然后使用 PCA 并仅返回组件和 2 个直通列?
MWE
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
pd.DataFrame(trans)
我强烈怀疑我对此步骤工作原理的误解是错误的:preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]
.我认为它在最后 4 列上运行,首先进行缩放,然后进行 PCA,最后返回 2 个分量,但我得到 8 列,前 4 列是缩放,接下来的 2 似乎是分量(可能它们不是)首先是缩放),最后是两列“passthrough”。
最佳答案
我认为这可行,但不知道这是否是 Python/scikit 解决它的方法:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)
pd.DataFrame(trans)
关于python - 将缩放和 pca 应用于 ColumnTransformer 中的列子集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65136230/