Python nltk 词干分析器从不删除前缀

标签 python nlp nltk stemming porter-stemmer

我正在尝试预处理单词以删除常见的前缀,如“un”和“re”,但是所有 nltk 的常见词干提取器似乎都完全忽略了前缀:

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

PorterStemmer().stem('unhappy')
# u'unhappi'

SnowballStemmer('english').stem('unhappy')
# u'unhappi'

LancasterStemmer().stem('unhappy')
# 'unhappy'

PorterStemmer().stem('reactivate')
# u'reactiv'

SnowballStemmer('english').stem('reactivate')
# u'reactiv'

LancasterStemmer().stem('reactivate')
# 'react'

词干分析器的工作不是删除常见的前缀和后缀吗?是否有另一个词干分析器可以可靠地做到这一点?

最佳答案

你是对的。大多数词干分析器只有词干后缀。事实上,Martin Porter 的原始论文标题为:

Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.

并且可能在 NLTK 中唯一具有前缀词干的词干分析器是阿拉伯语词干分析器:

但是如果我们看一下这个prefix_replace功能, 它只是删除旧前缀并将其替换为新前缀。

def prefix_replace(original, old, new):
    """
     Replaces the old prefix of the original string by a new suffix
    :param original: string
    :param old: string
    :param new: string
    :return: string
    """
    return new + original[len(old):]

但我们可以做得更好!

首先,您是否有需要处理的语言的固定前缀和替换列表?

让我们使用(不幸的是)事实上的语言,英语,并做一些语言学工作来找出英语中的前缀:

https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes

无需太多工作,您可以在源自 NLTK 的后缀之前编写一个前缀词干提取函数,例如

import re
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

def stem_prefix(word, prefixes):
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0:
            return word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes))


word = "extraordinary"
porter_english_plus(word)

既然我们有了一个简单的前缀词干提取器,我们还能做得更好吗?

# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"

如果我们在词干提取之前检查前缀词干词是否出现在某个列表中怎么办?

import re

from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

whitelist = list(wn.words()) + words.words()

def stem_prefix(word, prefixes, roots):
    original_word = word
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0 and word in roots:
            return word
    return original_word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes, whitelist))

我们解决了不阻止前缀的问题,导致无意义的根,例如

>>> stem_prefix("united", english_prefixes, whitelist)
"united"

但是 porter stem 仍然会删除后缀 -ed,这可能是/可能不是人们需要的期望输出,尤其是。当目标是在数据中保留语言上合理的单位时:

>>> porter_english_plus("united")
"unit"

因此,根据任务的不同,有时使用引理比使用词干分析器更有用。

另见:

关于Python nltk 词干分析器从不删除前缀,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52140526/

相关文章:

python - BS4 Python : Trying to take page links from Google but the URLs I get are all the same

javascript - 无法输入 django Chartit 格式化程序字段

python - 使用列表推导式向字典中的值加一

python - 小写与小写+标题版本的特定字数不同

python - 如何加速基于仅生成结果(右侧)是数据集的一个元素的关联规则的 Apriori 框架?

python - Sage/Python 中的联合

tensorflow - 微调 Blenderbot

machine-learning - 情感分析/分类任务中二进制与 tfidf Ngram 特征的比较?

java - 确认实体的最佳方法

python - 'generator' 类型的对象没有 len()