python-3.x - CountVectorizer 中的 fit() 有何意义？

标签 python-3.x machine-learning scikit-learn

我有实现 naivebayes 垃圾邮件分类器的代码，它实现了一个 CountVectorizer，如下所示

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(stop_words='english')
vect.fit(x_train)
vect.vocabulary_
x_train_transformed=vect.transform(x_train)
y_train_transformed=vect.transform(x_test)
print(type(x_train_transformed))
print(x_train_transformed)

这里的fit()有什么意义？为什么我们只拟合x_train而不拟合x_test，却同时变换x_train和x_test？

我知道 CountVectorizer 的转换方法将数据帧转换为词袋(正如他们所说)但是 fit() 方法在这里有什么意义？

最佳答案

正如documentation所述，fit方法“学习原始文档中所有标记的词汇字典”，即它创建一个标记字典(默认情况下标记是用空格和标点符号分隔的单词)，将每个单个标记映射到输出矩阵。对训练集进行拟合并对训练和测试集进行转换可确保给定一个单词，该单词始终正确地映射到训练集和测试集中的同一列上。

关于python-3.x - CountVectorizer 中的 fit() 有何意义？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54973289/

上一篇：python-3.x - 如何将标量数组转换为二维数组？

下一篇：python - 回归问题的 Hyperas 损失函数

python - Keras - 中等精度，糟糕的预测

python - 多项式次数小于或等于指定多项式次数的特征之间的哪些组合算作多项式组合？

python - 属性错误: module 'tensorflow_docs' has no attribute 'plots'

python - 使用 LinearRegression Python 进行递归特征消除

python - SelectKBest 与 chi2 给出 ValueError : could not convert string to float

python - 根据另一个数据帧中的匹配 id 替换数据帧列值

python - 什么时候 `string.swapcase().swapcase()` 不等于 `string` ？

python - 2-opt 算法解决 Python 中的旅行商问题

python - 带有 python 的 tkinter 中的默认输入字段