python - CountVectorizer 忽略 'I'

为什么 sklearn 中的 CountVectorizer 会忽略代词“我”？

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']

最佳答案

默认分词器只考虑 2 个字符(或更多)的单词。

您可以通过将适当的 token_pattern 传递给您的 CountVectorizer 来更改此行为。

默认模式是(参见 the signature in the docs):

'token_pattern': u'(?u)\\b\\w\\w+\\b'

您可以通过更改默认值来获得不丢弃单字母单词的 CountVectorizer，例如:

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

给出:

['gave it', 'he gave', 'it to', 'to i']

关于python - CountVectorizer 忽略 'I'，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33260505/

上一篇：python - django -'NoneType' 对象不可调用

下一篇：python - 如何在 Pandas 数据框中找到 5 分钟的间隔？

相关文章：

python-3.x - Sklearn - 线性回归

opencv - 图像未使用 DBSCAN 正确分割

python - Scikit 在使用 fit() 函数时学习 GaussianProcessClassifier 内存错误

php - 音乐识别与信号处理

python - 基于开始和结束日期 Pandas 的复杂合并

python - 如何创建以一定角度显示一堆图像的图形？

python - 在 python 中使用 BernoulliNB(朴素贝叶斯分类器)scikit-learn 的简单示例 - 无法解释分类

python - np.mean() 导致内存不足错误

python - Django 模板图片无法加载

python - Flask sqlalchemy 多对多插入数据