带有自定义缩写的 Python nltk 不正确的句子标记化

标签 python nlp nltk tokenize

我正在使用 nltk tokenize拆分英语句子的库。 许多句子包含缩写,例如 e.g.eg. 因此我用这些自定义缩写更新了分词器。 不过,我发现一个奇怪的标记化行为有一句话:

import nltk

nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)'

for s in sentence_tokenizer.tokenize(line):
    print(s)

# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g.
# Karma, Tape)

如您所见,分词器不会在第一个缩写(正确)上拆分,但在第二个缩写上拆分(不正确)。

奇怪的是,如果我在任何其他地方更改单词 Karma,它都能正常工作。

import nltk

nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)'

for s in sentence_tokenizer.tokenize(line):
    print(s)

# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)

知道为什么会这样吗?

最佳答案

您可以看到为什么 punkt 使用 debug_decisions 做出中断选择。方法。

>>> for d in sentence_tokenizer.debug_decisions(line):
...     print(nltk.tokenize.punkt.format_debug_decision(d))
... 
Text: '(e.g. React,' (at offset 47)
Sentence break? None (default decision)
Collocation? False
'e.g.':
    known abbreviation: True
    is initial: False
'react':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? unknown
    orthographic contexts in training: {'MID-UC', 'MID-LC'}

Text: '(e.g. Karma,' (at offset 80)
Sentence break? True (abbreviation + orthographic heuristic)
Collocation? False
'e.g.':
    known abbreviation: True
    is initial: False
'karma':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? True
    orthographic contexts in training: {'MID-LC'}

这告诉我们在用于训练的语料库中,'react' 和 'React' 都出现在句子的中间,所以它不会在你的行中的 'React' 之前中断。然而,只出现小写形式的 'karma',因此它认为这是一个可能的句子起点。

请注意,这与库的文档一致:

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

因此,虽然针对这种特殊情况的快速破解是调整私有(private) _params 进一步说“Karma”也可能出现在句子中间:

>>> sentence_tokenizer._params.ortho_context['karma'] |= nltk.tokenize.punkt._ORTHO_MID_UC
>>> sentence_tokenizer.tokenize(line)
['Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)']

也许您应该从包含所有这些库名称的 CV 添加额外的训练数据:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
# tweak trainer params here if helpful
trainer.train(my_corpus_of_concatted_tech_cvs)
sentence_tokenizer = PunktSentenceTokenizer(trainer.get_params())

关于带有自定义缩写的 Python nltk 不正确的句子标记化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60737849/

相关文章:

python - CherryPy WebService 不将 NLTK 搭配返回到浏览器窗口

python - 网站.py : AttributeError: 'module' object has no attribute 'ModuleType' upon running any python file in PyCharm

python - TensorFlow - Tflearning 错误 feed_dict

python - django 从连接到任何网络的任何机器访问本地主机

solr - 是否有处理莎士比亚英语的 Lucene 词干提取器?

python - 使用 Wordnet 生成最高级、比较级和形容词

python - 使用样本语料库训练机器学习算法,然后从任意文本中提取相似部分

python - 如何删除 NaN 并挤入 DataFrame - pandas

python - 将计数器转换为具有链接列表值的哈希表

python-3.x - 使用 Spacy Textcat 自定义损失函数