python - 如何将预处理器传递给 TfidfVectorizer？ - sklearn - python

标签 python preprocessor scikit-learn

如何将预处理器传递给 TfidfVectorizer？我做了一个函数，它接受一个字符串并返回一个预处理的字符串然后我将处理器参数设置为该函数“preprocessor=preprocess”，但它不起作用。找了很多次，没找到例子，好像没人用过。

我还有一个问题。它(预处理器参数)是否覆盖了可以使用停止词和小写参数完成的删除停用词和小写字母的操作？

最佳答案

您只需定义一个函数，该函数将字符串作为输入并返回要预处理的内容。因此，例如，大写字符串的简单函数如下所示:

def preProcess(s):
    return s.upper()

创建函数后，只需将其传递给 TfidfVectorizer 对象即可。例如:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

结果:

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

这间接回答了您的后续问题，因为尽管小写被设置为 true，但大写的预处理函数会覆盖它。文档中也提到了这一点:

preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

关于python - 如何将预处理器传递给 TfidfVectorizer？ - sklearn - python ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23850256/

上一篇：python - Pandas :时间戳索引四舍五入到最近的第 5 分钟

下一篇：python - 如何将 Django 对象存储为 session 变量(对象不是 JSON 可序列化的)？

相关文章：

python - 列表、整数和字符串等对象的值存储在哪里？

python - 使用 Python Splinter 库时出错

ios - 如何确定代码是否在 DEBUG/RELEASE build 中运行？

machine-learning - 为什么sklearn DecisionTreeClassifier的决策树结构只是二叉树？

python - 带有 GridSearchCV 的随机森林 - param_grid 上的错误

python - 下采样非均匀一维信号

python - 将具有不同键的字典的值相乘

c++ - 您可以将一个宏作为参数提供给另一个宏，而不扩展初始宏吗？

c++ - 根据是否定义了另一个宏来评估宏

python - 对训练和测试数据帧使用相同的标签编码器