我正在尝试预处理单词以删除常见的前缀,如“un”和“re”,但是所有 nltk 的常见词干提取器似乎都完全忽略了前缀:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
PorterStemmer().stem('unhappy')
# u'unhappi'
SnowballStemmer('english').stem('unhappy')
# u'unhappi'
LancasterStemmer().stem('unhappy')
# 'unhappy'
PorterStemmer().stem('reactivate')
# u'reactiv'
SnowballStemmer('english').stem('reactivate')
# u'reactiv'
LancasterStemmer().stem('reactivate')
# 'react'
词干分析器的工作不是删除常见的前缀和后缀吗?是否有另一个词干分析器可以可靠地做到这一点?
最佳答案
你是对的。大多数词干分析器只有词干后缀。事实上,Martin Porter 的原始论文标题为:
Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.
并且可能在 NLTK 中唯一具有前缀词干的词干分析器是阿拉伯语词干分析器:
- https://github.com/nltk/nltk/blob/develop/nltk/stem/arlstem.py#L115
- https://github.com/nltk/nltk/blob/develop/nltk/stem/snowball.py#L372
但是如果我们看一下这个prefix_replace
功能,
它只是删除旧前缀并将其替换为新前缀。
def prefix_replace(original, old, new):
"""
Replaces the old prefix of the original string by a new suffix
:param original: string
:param old: string
:param new: string
:return: string
"""
return new + original[len(old):]
但我们可以做得更好!
首先,您是否有需要处理的语言的固定前缀和替换列表?
让我们使用(不幸的是)事实上的语言,英语,并做一些语言学工作来找出英语中的前缀:
https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
无需太多工作,您可以在源自 NLTK 的后缀之前编写一个前缀词干提取函数,例如
import re
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
def stem_prefix(word, prefixes):
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0:
return word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes))
word = "extraordinary"
porter_english_plus(word)
既然我们有了一个简单的前缀词干提取器,我们还能做得更好吗?
# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"
如果我们在词干提取之前检查前缀词干词是否出现在某个列表中怎么办?
import re
from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
whitelist = list(wn.words()) + words.words()
def stem_prefix(word, prefixes, roots):
original_word = word
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0 and word in roots:
return word
return original_word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes, whitelist))
我们解决了不阻止前缀的问题,导致无意义的根,例如
>>> stem_prefix("united", english_prefixes, whitelist)
"united"
但是 porter stem 仍然会删除后缀 -ed
,这可能是/可能不是人们需要的期望输出,尤其是。当目标是在数据中保留语言上合理的单位时:
>>> porter_english_plus("united")
"unit"
因此,根据任务的不同,有时使用引理比使用词干分析器更有用。
另见:
关于Python nltk 词干分析器从不删除前缀,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52140526/