python - 如何向量化Python单词列表？

标签 python machine-learning scikit-learn nlp

我正在尝试将 CountVectorizer 模块与 Sci-kit Learn 一起使用。从我读到的来看，它似乎可以用在句子列表上，例如:

['这是第一个文档。','这是第二个文档。','第三个文档。', '这是第一个文档吗？']

但是，有没有一种方法可以向量化列表形式的单词集合，例如 [['this', 'is', 'text', 'document', 'to', 'analyze'], ['和”、“这个”、"is"、“这个”、“第二”]、[“和”、“这个”、“和”、“那个”、"is"、“第三”]？

我尝试使用 ' '.join(wordList) 将每个列表转换为句子，但出现错误:

TypeError: sequence item 13329: expected string or Unicode, generator found

当我尝试运行时:

vectorizer = CountVectorizer(min_df=50)
ratings = vectorizer.fit_transform([' '.join(wordList)])

谢谢!

最佳答案

我想你需要这样做:

counts = vectorizer.fit_transform(wordList)  # sparse matrix with columns corresponding to words
words = vectorizer.get_feature_names()  # array with words corresponding to columns

最后，得到[['this', 'is', 'text', 'document', 'to', 'analyze']]

sample_idx = 1
sample_words = [words[i] for i, count in 
                enumerate(counts.toarray()[sample_idx]) if count > 0]

关于python - 如何向量化Python单词列表？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42731011/

上一篇：python - (千层面)ValueError : Input dimension mis-match

下一篇：machine-learning - scikit-learn 在另一个特征的标称值组内估算特征的平均值

相关文章：

python - 如何替换列表列表中的字符

python - 合并两个带有 id 的数据帧

python - 如何使用 Selenium(Python) 单击表格内的按钮

machine-learning - "batch normalizaiton"是什么？为什么使用它？它如何影响预测？

python - 使用概率响应(或成功/失败次数)而不是二进制输出来训练模型

python - 交叉验证 : cross_val_score function from scikit-learn arguments

python - 如何用字符串替换 lxml 中的元素

python - Session.run(fetches) 是否保证按顺序执行其 "fetches"参数？

machine-learning - 神经网络梯度下降中的反向传播与线性回归

python - Scikit-learn，随机森林——每棵树包含多少个样本？