python - 如何在管道中重新采样文本(不平衡组)?

标签 python pipeline text-classification resampling oversampling

我正在尝试使用 MultinomialNB 进行一些文本分类,但我遇到了问题,因为我的数据不平衡。 (为简单起见,下面是一些示例数据。实际上,我的数据要大得多。)我正在尝试使用过采样对我的数据进行重新采样,并且理想情况下我希望将其构建到此管道中。

下面的管道在没有过度采样的情况下工作正常,但同样,在现实生活中我的数据需要它。这是非常不平衡的。

使用此当前代码,我不断收到错误消息:“TypeError:所有中间步骤都应该是转换器并实现拟合和转换。”

如何将 RandomOverSampler 构建到此管道中?

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]

df = pd.DataFrame(data,columns=['Description','Type'])  

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

最佳答案

您应该使用在 imblearn 中实现的管道包裹,不是来自 sklearn 的那个.例如,此代码运行良好:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline


data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
    ['small fruits', 'grapes']]

df = pd.DataFrame(data, columns=['Description','Type'])

X_train, X_test, y_train, y_test = train_test_split(df['Description'],
    df['Type'], random_state=0)

text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

关于python - 如何在管道中重新采样文本(不平衡组)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54118076/

相关文章:

python - 如何获得元素总和较高的列表

neural-network - 我怎样才能得到伯特预训练模型中最后一个变压器编码器的所有输出,而不仅仅是 cls token 输出?

url - 如何对URL进行分类? URL 的特点是什么?如何从 URL 中选择和提取特征

machine-learning - 无法训练我的 keras 模型 : (Data cardinality is ambiguous:)

python - 使用 selenium 进行网页抓取,单击具有相同/相似名称且输入文本忽略大小写的链接

python - Kivy:覆盖子类中的属性

python视频捕获循环

CUDA 核心管道

c# - SSIS - PipelineComponent 中的 ProcessInput 被多次调用

Azure DevOps Pipeline AZ CLI 批量上传不推送文件