python - spaCy 和 scikit-learn 向量化器

标签 python scikit-learn nlp spacy

我根据他们的 example 使用 spaCy 为 scikit-learn 编写了一个引理分词器,它可以独立运行:

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())['Apples and oranges are tasty.'])
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

但是,在 GridSearchCV 中使用它会出错,下面是一个独立的示例:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['', '']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X =
y =
gs_clf =, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

当我在 tokenizer 的构造函数之外加载 spacy 时,错误不会出现,然后 GridSearchCV 运行:

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

但这意味着 GridSearchCV 中我的每个 n_jobs 都将访问并调用同一个 spacynlp 对象,它在这些作业之间共享,这就留下了问题:

  1. spacy.load('en') 中的 spacynlp 对象是否可以安全地供 GridSearchCV 中的多个作业使用?
  2. 这是在 scikit-learn 的分词器中调用 spacy 的正确方法吗?


根据 mbatchkarov 帖子的评论,我尝试通过 Spacy 运行 pandas 系列中的所有文档一次以进行标记化和词形还原,然后先将其保存到磁盘。 然后,我加载了词形还原的 spacy Doc 对象,为每个文档提取了一个标记列表,并将其作为输入提供给由简化的 TfidfVectorizer决策树分类器。 我使用 GridSearchCV 运行 pipeline 并提取最佳估计器和相应的参数。


from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("de_core_news_sm") # define your language model

# adjust attributes to your liking:
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)

for doc in nlp.pipe(df['articleDocument'].str.lower()):

# either save DocBin to a bytes object, or...
#bytes_data = doc_bin.to_bytes()

# save DocBin to a file on disc
file_name_spacy = 'output/preprocessed_documents.spacy'

#Load DocBin at later time or on different system from disc or bytes object
#doc_bin = DocBin().from_bytes(bytes_data)
doc_bin = DocBin().from_disk(file_name_spacy)

docs = list(doc_bin.get_docs(nlp.vocab))

tokenized_lemmatized_texts = [[token.lemma_ for token in doc 
                               if not token.is_stop and not token.is_punct and not token.is_space and not token.like_url and not token.like_email] 
                               for doc in docs]

# classifier to use
clf = tree.DecisionTreeClassifier()

# just some random target response
y = np.random.randint(2, size=len(docs))

vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=lambda x: x, max_features=3000)

pipeline = Pipeline([('vect', vectorizer), ('dectree', clf)])
parameters = {'dectree__max_depth':[4, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5), y)


关于python - spaCy 和 scikit-learn 向量化器,我们在Stack Overflow上找到一个类似的问题:


