python - vectorizer fit_transform 如何在 sklearn 中工作?

标签 python machine-learning scikit-learn

我试图理解下面的代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印 X 以查看将返回什么时,我得到了这个结果:

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是,我不明白这个结果的含义?

最佳答案

正如@Himanshu 所写,这是一个“(sentence_index, feature_index) count”

这里,计数部分是“一个词在文档中出现的次数”

例如,

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

让我们更改代码中的语料库。基本上,我在语料库列表的第二句中添加了两次“第二”这个词。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

关于python - vectorizer fit_transform 如何在 sklearn 中工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47898326/

相关文章:

python - 按名称初始化、设置和获取自定义对象属性的 Pythonic 方法是什么?

Python 正则表达式意外行为

machine-learning - "high-capacity cnn"或 "high-capacity architecture"的定义是什么?

python - 如何从整数中取回数据。我的 model.predict() 不起作用

machine-learning - 一个输入特征的多项式回归

python - 在 python 中支持向量机分类器的替代方法?

python - 改变随机森林中每棵树的权重

python - 如果任何模式不存在,则删除模式之间的线并打印所有

作为 Python 扩展模块的 C++ API

python - 我已经安装了 scikit-learn/sklearn。运行python文件后出现此错误