python - 变换预测目标

标签 python machine-learning scikit-learn multilabel-classification

我有一个数据集,其中每个观察结果可能属于不同的标签(多标签分类)。

我已经对其进行了 SVM 分类及其工作。 (在这里,我有兴趣查看每个类的准确性,因此我对每个类应用了 OneVsRestClassifier,正如您将在代码中看到的那样。)

我想查看测试数据中每个项目的预测值。换句话说,我想看看模型在测试样本中的每个观察结果中预测了哪个标签。

例如: 这是传递给模型进行预测的数据

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,1
5,I have no idea when this will end.,1,0,0,0,0,0,1

然后我的模型已经预测了这些行的标签,我想查看每行的预测映射。

我知道我们可以使用 scikit-learn 库中的标签二值化来做到这一点。

问题是 fit_transform 的输入参数解释 here与我准备并传递给 SVM 分类的目标数据不同。 所以我不知道如何弄清楚。

这是我的代码:

df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']

train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)
X_train = train.sentences
X_test = test.sentences

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    SVC_pipeline.fit(X_train,train[category]
    prediction = SVC_pipeline.predict(X_test)
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted'))
    print "\n"

非常感谢您的宝贵时间。

最佳答案

这就是你想要的,我刚刚所做的就是映射了预测,它是一个numpy数组,表示categories列表中的类标签索引。这是完整的代码。

import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']

train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)
X_train = train.sentences
X_test = test.sentences

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=[])),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])


for category in categories:
    print('... Processing {} '.format(category))
    SVC_pipeline.fit(X_train,train[category])
    prediction = SVC_pipeline.predict(X_test)
    print([{X_test.iloc[i]:categories[prediction[i]]} for i in range(len(list(prediction)))  ])

    print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction)))
    print ('SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted')))
    print ("\n")

这是示例输出:

... Processing ADR 
[{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]
SVM Linear Test accuracy is 0.5 
SVM Linear f1 measurement is 0.3333333333333333 


... Processing WD 
[{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]
SVM Linear Test accuracy is 1.0 
SVM Linear f1 measurement is 1.0 

我希望这有帮助。

关于python - 变换预测目标,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51853429/

相关文章:

python - 循环图度计算中的 Sage python bug?

python - 使用字典理解python删除多个关键项目

machine-learning - GridSearchCV评分参数: using scoring ='f1' or scoring=None (by default uses accuracy) gives the same result

python - PyCharm 没有将源根添加到 `sys.path`

python - 相当于 Python Numpy 函数的 F# 库或 .Net Numerics

machine-learning - 假设优化收敛,逻辑回归是否总能找到全局最优值?

ruby - 使用 Google 数据进行机器学习

python - 当只有一个输入时如何处理MinMaxScaler?

python - 如何使用对数损失度量将 sgdclassifier 铰链损失与 Gridsearchcv 一起使用?

python - 查询矩阵中的行