python - 除了词干还有什么其他选择？

给定这样的单词列表['add', 'adds', 'adding', 'added', 'addition']，我想将它们全部词干为同一个单词“添加”。这意味着将一个单词的所有不同动词和名词形式(但不是其形容词和副词形式)合而为一。

我找不到任何可以做到这一点的词干分析器。我发现的最接近的是 PorterStemmer，但它将上面的列表分为 ['add', 'add', 'ad', 'ad', 'addit']

我对词干提取技术不太有经验。所以，我想问是否有任何可用的词干分析器可以实现我上面解释的功能？如果没有，您对如何实现这一目标有什么建议吗？

非常感谢，

最佳答案

Lemmatization应该会产生比词干提取更好的结果 ( source ):

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

NTLK 支持词形还原，作为 nltk.stem 的一部分封装:

import nltk

l = nltk.stem.WordNetLemmatizer()
l.lemmatize('dogs')     # -> 'dog'
l.lemmatize('addition') # -> 'addition'

s = nltk.stem.snowball.EnglishStemmer()
s.stem('dogs')          # -> 'dog'
s.stem('addition')      # -> 'addit'

如果词形还原器无法识别该单词，则不会对其进行更改。一个陷阱是默认情况下所有单词都被视为名词。要覆盖该行为，您必须设置 pos 参数，默认设置为 pos='n':

s.stem('better')               # -> 'better'
l.lemmatize('better')          # -> 'better'
l.lemmatize('better', pos='a') # -> 'good'

关于python - 除了词干还有什么其他选择？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15819339/

python - 除了词干还有什么其他选择？

上一篇：python - 使用 pygame 和多处理在单独进程中运行的并发函数

下一篇：python - 挑出 xml 文档中的标签？