python-3.x - Sklearn tf-idf TfidfVectorizer 未能捕获一个字母单词

标签 python-3.x scikit-learn nlp tf-idf tfidfvectorizer

一个特定的实例是“Queens Stop 'N' Swap”。转换后，我只得到了三个特征['Queens', 'Stop', 'SWap']。 'N' 已被忽略。如何捕获'N'？所有参数都是我的代码中的默认设置。

### Create the vectorizer method
tfidf_vec = TfidfVectorizer()

### Transform the text into tf-iwine vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

最佳答案

您没有将 'n' 作为 token ，因为默认 token 生成器不将其视为 token :

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["Queens Stop 'N' Swap",]
tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b',)
tfidf.fit(texts)
tfidf.vocabulary_
{'queens': 0, 'stop': 1, 'swap': 2}

要捕获 1 个字母标记，并保留大写，请将其更改为:

tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b',lowercase=False)
tfidf.fit(texts)
tfidf.vocabulary_
{'Queens': 1, 'stop': 2, 'N': 0, 'swap': 3}

关于python-3.x - Sklearn tf-idf TfidfVectorizer 未能捕获一个字母单词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64613067/

上一篇：javascript - 通过 MailApp.sendEmail() 发送时，HTML 正文不会显示在电子邮件中

下一篇：javascript - 如何使用 Javascript 用较短的代码创建多个段落？

python - 编码分类变量后如何跟踪列？

python - 使用 array.reshape(-1, 1) reshape 数组

java - 如何在OpenNLP中进行嵌套命名实体识别？

machine-learning - 从科学论文中提取特定信息

Python 万无一失的 min/max 函数，带 None

python - 在 pyinstaller 中以窗口模式导出到 EXE 后，Selenium 不起作用

python - 在点击命令之上装饰

python-2.7 - 使用 sklearn 中的 OneVsRestClassifier 将定制的二元分类调整为多类分类

python - 使用 nlp/spacy 查找相似之处