python - 为什么我在 python 上收到列表对象不可调用错误？

我正在处理这个数据集[ https://archive.ics.uci.edu/ml/datasets/Reuter_50_50]并尝试分析文本特征。

我读取文件并将其存储在文档变量中，如下所示:

documents=author_labels(raw_data_dir)
documents.to_csv(documents_filename,index_label="document_id")
documents=pd.read_csv(documents_filename,index_col="document_id")
documents.head()

随后，我尝试使用次线性增长生成 tf-idf 向量并将其存储在名为向量化器的变量中。

vectorizer = TfidfVectorizer(input="filename",tokenizer=tokenizer,stop_words=stopwords_C50)

然后，我尝试使用以下方法为语料库中的每个文档生成 tfidf 表示形式的矩阵 X:

X = vectorizer.fit_transform(documents["filename"])

但是，我收到以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-152-8c01204baf0e> in <module>
----> 1 X = vectorizer.fit_transform(documents["filename"])

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1611         """
   1612         self._check_params()
-> 1613         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1614         self._tfidf.fit(X)
   1615         # X is already a transformed view of raw_documents so

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1029 
   1030         vocabulary, X = self._count_vocab(raw_documents,
-> 1031                                           self.fixed_vocabulary_)
   1032 
   1033         if self.binary:

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    941         for doc in raw_documents:
    942             feature_counter = {}
--> 943             for feature in analyze(doc):
    944                 try:
    945                     feature_idx = vocabulary[feature]

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    327                                                tokenize)
    328             return lambda doc: self._word_ngrams(
--> 329                 tokenize(preprocess(self.decode(doc))), stop_words)
    330 
    331         else:

TypeError: 'list' object is not callable

如何解决这个问题？

最佳答案

好吧，我找到了自己问题的答案。

如果我删除向量化器中的所有参数，这样:

vectorizer = TfidfVectorizer()

代码运行得很好。然后，我把输入参数加回来，它仍然工作正常。

vectorizer = TfidfVectorizer(input="filename")

如果我重新添加停用词，则同上:

vectorizer = TfidfVectorizer(input="filename",stop_words=stopwords_C50)

但是，当我传递分词器时，它会抛出错误。

事实证明，我传递给向量化器的参数是一个标记列表，而它本应是另一个函数。

我定义了一个函数stem_tokenizer，如下所示:

def stem_tokenizer(text):
    return [porter_stemmer.stem(token) for token in word_tokenize(text.lower())]

并且，将其传递给向量化器:

vectorizer = TfidfVectorizer(input="filename",tokenizer = stem_tokenizer, stop_words=stopwords_C50)

这解决了我的问题。

关于python - 为什么我在 python 上收到列表对象不可调用错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58141049/

python - 为什么我在 python 上收到列表对象不可调用错误？

上一篇：python - Pandas 融化了 : Columns to Rows

下一篇：python - 如何使用python发送字符串+变量作为短信？