python - Spacy is_stop 无法识别停用词？

当我使用 SpaCy 识别停用词时，如果使用 en_core_web_lg 语料库，它不起作用，但当我使用 en_core_web_sm 时，它会起作用。这是一个错误，还是我做错了什么？

import spacy
nlp = spacy.load('en_core_web_lg')

doc = nlp(u'The cat ran over the hill and to my lap')

for word in doc:
    print(f' {word} | {word.is_stop}')

结果:

 The | False
 cat | False
 ran | False
 over | False
 the | False
 hill | False
 and | False
 to | False
 my | False
 lap | False

但是，当我更改此行以使用 en_core_web_sm 语料库时，我得到了不同的结果:

nlp = spacy.load('en_core_web_sm')

 The | False
 cat | False
 ran | False
 over | True
 the | True
 hill | False
 and | True
 to | True
 my | True
 lap | False

最佳答案

您遇到的问题已记录在案 bug 。建议的解决方法如下:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load('en_core_web_lg')
for word in STOP_WORDS:
    for w in (word, word[0].capitalize(), word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

doc = nlp(u'The cat ran over the hill and to my lap')

for word in doc:
    print('{} | {}'.format(word, word.is_stop))

输出

The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False

关于python - Spacy is_stop 无法识别停用词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52263757/

上一篇：Python 将日期转换和聚合到月份列

下一篇：Python 数组值在函数调用后意外更改

python - Pandas dataframe read_excel 不会将左上角的空白单元格视为列？

python-3.x - Spacy中有双字母组或三字母组合功能吗？

nlp - 命名实体识别相对日期

python - 编写创建并返回列表的函数

python - 如何建立Python项目？

python - 查找句子中代词和名词之间的关系

java - 将复数名词转换为单数

json - 格式化 SpaCy NER 的训练数据集

python - 在 Heroku 应用程序上安装 spacy en_core_web_lg 时出错