python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer

给定这个简单的多标签分类示例(取自这个问题，use scikit-learn to classify into multiple categories)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    ["new york"],
            ["new york"],["london"],["london"],["london"],["london"],
            ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)


print "Accuracy Score: ",accuracy_score(Y_test, predicted)

代码运行良好，并打印准确度分数，但是如果我将 y_test_text 更改为

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

我明白了

Traceback (most recent call last):
  File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
     print "Accuracy Score: ",accuracy_score(Y_test, predicted)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

请注意引入了不在训练集中的“england”标签。我如何使用多标签分类，以便在引入“测试”标签时，我仍然可以运行一些指标？或者这甚至可能吗？

编辑:感谢大家的回答，我想我的问题更多是关于 scikit 二值化器如何工作或应该如何工作。鉴于我的简短示例代码，如果我将 y_test_text 更改为

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

它会起作用——我的意思是我们已经适应了那个标签，但在这种情况下我明白了

ValueError: Can't handle mix of binary and multilabel-indicator

最佳答案

如果您也在训练 y 集中“引入”新标签，则可以，如下所示:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    
                ["new york"],["new york"],["london"],["london"],         
                ["london"],["london"],["london"],["london"],
                ["new york","England"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]


lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

print Y_test

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted

print "Accuracy Score: ",accuracy_score(Y_test, predicted)

输出:

Accuracy Score:  0.571428571429

关键部分是:

y_train_text = [["new york"],["new york"],["new york"],
                ["new york"],["new york"],["new york"],
                ["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","England"],
                ["new york","london"]]

我们也插入了“England”。这是有道理的，因为如果分类器以前没有看到它，其他方式如何预测分类器？所以我们以这种方式创建了一个三标签分类问题。

已编辑:

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

您必须将类作为 arg 传递给 MultiLabelBinarizer()，它可以与任何 y_test_text 一起使用。

关于python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31503874/

python - 在标签不在训练集中的测试数据上使用 MultilabelBinarizer

上一篇：python - 谁能给我解释一下 numpy.indices()？

下一篇：python - 编译并上传到pypicloud服务器后运行python包