我正在尝试使用 LogisticRegression
用于文本分类。我正在使用 FeatureUnion
用于DataFrame
的功能然后 cross_val_score
来测试分类器的准确性。但是,我不知道如何在自由文本中包含该功能,名为 tweets
,管道内。我正在使用 TfidfVectorizer
对于词袋模型。
nominal_features = ["tweeter", "job", "country"]
numeric_features = ["age"]
numeric_pipeline = Pipeline([
("selector", DataFrameSelector(numeric_features))
])
nominal_pipeline = Pipeline([
("selector", DataFrameSelector(nominal_features)),
"onehot", OneHotEncoder()])
text_pipeline = Pipeline([
("selector", DataFrameSelector("tweets")),
("vectorizer", TfidfVectorizer(stop_words='english'))])
pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline),
("nominal_pipeline", nominal_pipeline)])),
("estimator", LogisticRegression())])
np.mean(cross_val_score(pipeline, df, y, scoring="accuracy", cv=5))
这是包含
tweets
的正确方法吗?管道中的自由文本数据?
最佳答案
pipeline = Pipeline([
('vect', CountVectorizer(stop_words='english',lowercase=True)),
("tfidf1", TfidfTransformer(use_idf=True,smooth_idf=True)),
('clf', MultinomialNB(alpha=1)) #Laplace smoothing
])
train,test=train_test_split(df,test_size=.3,random_state=42, shuffle=True)
pipeline.fit(train['Text'],train['Target'])
predictions=pipeline.predict(test['Text'])
print(test['Target'],predictions)
score = f1_score(test['Target'],predictions,pos_label='positive',average='micro')
print("Score of Naive Bayes is :" , score)
关于python - 使用管道进行逻辑回归的文本分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53468055/