python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer

标签 python machine-learning scikit-learn

给定这个简单的多标签分类示例(取自这个问题,use scikit-learn to classify into multiple categories)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    ["new york"],
            ["new york"],["london"],["london"],["london"],["london"],
            ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)


print "Accuracy Score: ",accuracy_score(Y_test, predicted)

代码运行良好,并打印准确度分数,但是如果我将 y_test_text 更改为

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

我明白了

Traceback (most recent call last):
  File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
     print "Accuracy Score: ",accuracy_score(Y_test, predicted)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

请注意引入了不在训练集中的“england”标签。我如何使用多标签分类,以便在引入“测试”标签时,我仍然可以运行一些指标?或者这甚至可能吗?

编辑:感谢大家的回答,我想我的问题更多是关于 scikit 二值化器如何工作或应该如何工作。鉴于我的简短示例代码,如果我将 y_test_text 更改为

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

它会起作用——我的意思是我们已经适应了那个标签,但在这种情况下我明白了

ValueError: Can't handle mix of binary and multilabel-indicator

最佳答案

如果您也在训练 y 集中“引入”新标签,则可以,如下所示:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    
                ["new york"],["new york"],["london"],["london"],         
                ["london"],["london"],["london"],["london"],
                ["new york","England"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]


lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

print Y_test

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted

print "Accuracy Score: ",accuracy_score(Y_test, predicted)

输出:

Accuracy Score:  0.571428571429

关键部分是:

y_train_text = [["new york"],["new york"],["new york"],
                ["new york"],["new york"],["new york"],
                ["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","England"],
                ["new york","london"]]

我们也插入了“England”。 这是有道理的,因为如果分类器以前没有看到它,其他方式如何预测分类器?所以我们以这种方式创建了一个三标签分类问题。

已编辑:

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

您必须将类作为 arg 传递给 MultiLabelBinarizer(),它可以与任何 y_test_text 一起使用。

关于python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31503874/

相关文章:

c# - ml.net 中分类数据的聚类

python - 如何在 sklearn 中修复这个自定义变压器?

Python 脚本未在新行中写入结果 - 新手

python - ValueError : Layer sequential_20 expects 1 inputs, 但它收到 2 个输入张量

python - Django-Docker-应用程序默认凭据不可用

machine-learning - 时间序列预测: weekly vs daily predictions

python - 如何在python中将元素添加到已排序的redis集中

python - 如何使用与其他两列匹配的Python填充数据集中的空值?

python - sklearn LogisticRegression 和更改分类的默认阈值

python - 从 nx1 二进制标签数组生成 one-hot 向量