我试图理解下面的代码
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
当我尝试打印 X 以查看将返回什么时,我得到了这个结果:
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 2
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
但是,我不明白这个结果的含义?
最佳答案
正如@Himanshu 所写,这是一个“(sentence_index, feature_index) count”
这里,计数部分是“一个词在文档中出现的次数”
例如,
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
让我们更改代码中的语料库。基本上,我在语料库列表的第二句中添加了两次“第二”这个词。
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1
关于python - vectorizer fit_transform 如何在 sklearn 中工作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47898326/