python - 加载 pickled 分类器数据 : Vocabulary not fitted Error

标签 python scikit-learn classification

我在这里阅读了所有相关问题,但找不到有效的解决方案:

我的分类器创建:

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: english_stemmer.stemWords(analyzer(doc))

tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english')


def create_tfidf(f):
    docs = []
    targets = []
    with open(f, "r") as sentences_file:
        reader = csv.reader(sentences_file, delimiter=';')
        reader.next()
        for row in reader:
            docs.append(row[1])
            targets.append(row[0])

    tfidf_matrix = tf.fit_transform(docs)
    print tfidf_matrix.shape
    # print tf.get_feature_names()
    return tfidf_matrix, targets


X,y = create_tfidf("l0.csv")
clf = LinearSVC().fit(X,y)

_ = joblib.dump(clf, 'linearL0_3gram_100K.pkl', compress=9)

这个位有效,并生成 .pkl,然后我尝试在不同的脚本中使用它:

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: english_stemmer.stemWords(analyzer(doc))

tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english')


clf = joblib.load('linearL0_3gram_100K.pkl')

print clf
test = "My super elaborate test string to test predictions"
print test + clf.predict(tf.transform([test]))[0]

我得到 ValueError: Vocabulary wasn't fitted or is empty!

根据要求编辑:错误回溯

 File "classifier.py", line 27, in <module>
    print test + clf.predict(tf.transform([test]))[0]
  File "/home/ec2-user/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1313, in transform
    X = super(TfidfVectorizer, self).transform(raw_documents)
  File "/home/ec2-user/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 850, in transform
    self._check_vocabulary()
  File "/home/ec2-user/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 271, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "/home/ec2-user/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: StemmedTfidfVectorizer - Vocabulary wasn't fitted.

最佳答案

好的,我通过使用管道将我的矢量化器保存在 .plk 中解决了这个问题

这是它的样子(也更简单):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')


class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: english_stemmer.stemWords(analyzer(doc))


def create_tfidf(f):
    docs = []
    targets = []
    with open(f, "r") as sentences_file:
        reader = csv.reader(sentences_file, delimiter=';')
        reader.next()
        for row in reader:
            docs.append(row[1])
            targets.append(row[0])
    return docs, targets


docs,y = create_tfidf("l1.csv")
tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english')
clf = LinearSVC()

vec_clf = Pipeline([('tfvec', tf), ('svm', clf)])

vec_clf.fit(docs,y)

_ = joblib.dump(vec_clf, 'linearL0_3gram_100K.pkl', compress=9)

另一方面:

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
import Stemmer
import pickle

english_stemmer = Stemmer.Stemmer('en')

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: english_stemmer.stemWords(analyzer(doc))


clf = joblib.load('linearL0_3gram_100K.pkl')
test = ["My super elaborate test string to test predictions"]
print test + clf.predict(test)[0]

重要事项:

transformer 和 tf 一样是管道的一部分,因此不需要重新声明一个新的矢量化器(这是之前的失败点,因为它需要训练数据中的词汇表),或者 .transform()测试字符串。

关于python - 加载 pickled 分类器数据 : Vocabulary not fitted Error,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31744519/

相关文章:

python - 如何从pytorch中的AlexNet中提取fc7特征作为numpy数组?

python - Q : big O of nested while loop inside for loop

python - 如何使随机网格搜索更加冗长? (似乎停止了,但无法诊断)

python - Python 中的 DecisonTreeclassifer() - 尝试构建树时出错

python - 为什么我的对数损失(或交叉熵)的实现没有产生相同的结果?

python - Sklearn - 绘制分类报告给出与基本平均值不同的输出?

python - QProgressDialog 的残骸挥之不去——有时

machine-learning - TensorFlow 中 sigmoid 后跟交叉熵和 sigmoid_cross_entropy_with_logits 有什么区别?

machine-learning - 用于网站分类的简单机器学习

python - 在Python中,是否存在使用具有仅使用方法的装饰器的参数的方法的常见模式?