python - 预处理、重采样和管道 - 以及两者之间的错误

我有一个包含不同类型变量的数据集:二进制、分类、数字、文本。

 Text                                                  Age      Type           Link           Start    Passed  Default
0 care packag saint luke cathol church wa ...           21.0    organisation    saintlukemclean <2001.0 0   0
1   opportun busi group center food support compan...   23.0    organisation    cfanj           <2003.0 0   0
2   holiday ice rink persh squar depart cultur sit...   98.0    home            culturela       >1975.0 0   0

我使用了不同的转换器，一种用于分类 (OneHotEncoder)，一种用于数值 (SimpleImputer)，一种用于文本变量 (CountVectorizer/TF-IDF):

categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')
# categorical_encoder =  ('CV',CountVectorizer())

numeric_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
]) 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])

转换我的特征并将它们传递到管道中(使用分类器 Logistic 回归、多项朴素拜耳、随机森林和 SVM)，如下所示:

preprocessing = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv, text_columns)
        ('category', categorical_preprocessing, categorical_columns),
        ('numeric', numeric_preprocessing, numerical_columns)
])

但是，我在这一步遇到了错误:

from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train) # <-- error

ValueError: Selected columns, ['Age','Default'] are not unique in dataframe.

此错误可能是由于我的过采样或由于我预处理特征的方式引起的...重采样的正确顺序应该是仅将其应用于训练集以避免过度拟合，但尚不清楚如果我需要在重采样之前/之后考虑不同类型的变量和变压器，请告诉我。

如果您能帮助我修复错误，让管道使用这些预处理工作，我将不胜感激。谢谢

请引用代码:

text_columns = ['Text']
    categorical_columns = ['Type', 'Link','Start']
    numerical_columns = ['Age','Default'] # can I consider the boolean as numerical?
            
          
        
    X = df[categorical_columns + numerical_columns+text_columns]
    y=  df['Passed']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=42)
            
     
    # Returning to one dataframe
    training_set = pd.concat([X_train, y_train], axis=1) # need for re-sampling technique
          
    passed=training_set[training_set['Passed']==1]
    not_passed=training_set[training_set['Passed']==0]

    # Oversampling the minority 
    oversample = resample(passed, 
                           replace=True, 
                     

  n_samples=len(not_passed),

# Returning to new training set
oversample_train = pd.concat([not_passed, oversample])
    
 train_df = oversample_train.copy() # this train set is after applying the re-sampling
 test_df = pd.concat([X_test, y_test], axis=1)

X_train=train_df.loc[:,train_df.columns !='Passed']
y_train=train_df[['Passed']

categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])
text_transformer_cv =  Pipeline(steps=[
    ('cntvec',CountVectorizer())
]) 
 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
]) # TF-IDF
       
preprocessing = ColumnTransformer(
    transformers=
    [('category', categorical_encoder, categorical_columns),
     ('numeric', numerical_pipe, numerical_columns), # I think this is causing the error. But I do not know why not also categorical columns
     ('text',text_transformer_cv, text_columns)
])

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train)
   
```

最佳答案

问题在于单个文本列的传递方式。我希望 scikit-learn 的 future 版本将允许 ['Text',] 但在此之前直接传递它:

...

text_columns = 'Text' # instead of ['Text']

preprocessing = ColumnTransformer(
    transformers=[
        ('text', text_preprocessing_cv, text_columns),
        ('category', categorical_preprocessing,
            categorical_columns), 
        ('numeric', numeric_preprocessing, numerical_columns)
    ],
    remainder='passthrough'
)

关于python - 预处理、重采样和管道 - 以及两者之间的错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66341086/

python - 预处理、重采样和管道 - 以及两者之间的错误

上一篇：arduino - ESP32 服务中断例程的速度有多快？

下一篇：sql - Oracle SQL REGEXP_REPLACE - 除指定字符串之外的所有内容