python - 使用 Python NLTK 对 trigrams 进行 Kneser-Ney 平滑

我正在尝试使用 Python NLTK 通过 Kneser-Ney 平滑来平滑一组 n-gram 概率。不幸的是，整个文档相当稀疏。

我正在尝试做的是:我将文本解析为三元组列表。从这个列表中，我创建了一个 FreqDist，然后使用该 FreqDist 来计算 KN 平滑分布。

不过我很确定，结果是完全错误的。当我对各个概率求和时，我得到的结果远远超过 1。以这个代码示例为例:

import nltk

ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")

freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
    prob_sum += kneser_ney.prob(i)
print(prob_sum)

输出为“41.51696428571428”。根据语料库的大小，这个值会无限大。这使得任何 prob() 返回的都不是概率分布。

看看 NLTK 代码，我会说实现是有问题的。也许我只是不明白代码应该如何使用。在那种情况下，你能给我一个提示吗？在任何其他情况下:你知道任何有效的 Python 实现吗？我真的不想自己实现它。

最佳答案

我认为您误解了 Kneser-Ney 正在计算什么。

来自 Wikipedia:

The normalizing constant λ_{w_i-1} has value chosen carefully to make the sum of conditional probabilities p_KN(w_i|w_i-1) equal to one.

当然，我们在这里讨论的是双字母组，但同样的原则也适用于高阶模型。基本上这句话的意思是，对于一个固定的上下文 w_i-1 (或更高阶模型的更多上下文)所有 w_i 的概率必须加起来为一.当您将所有样本的概率相加时，您所做的是包括多个上下文，这就是为什么您最终得到的“概率”大于 1。如果您保持上下文固定，如以下代码示例所示，您最终会得到数字 <= 1。



    from nltk.util import ngrams
    from nltk.corpus import gutenberg

    gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
    freq_dist = nltk.FreqDist(gut_ngrams)
    kneser_ney = nltk.KneserNeyProbDist(freq_dist)

    prob_sum = 0
    for i in kneser_ney.samples():
        if i[0] == "I" and i[1] == "confess":
            prob_sum += kneser_ney.prob(i)
            print "{0}:{1}".format(i, kneser_ney.prob(i))
    print prob_sum

The output, based on the NLTK Gutenberg corpus subset, is as follows.



    (u'I', u'confess', u'.--'):0.00657894736842
    (u'I', u'confess', u'what'):0.00657894736842
    (u'I', u'confess', u'myself'):0.00657894736842
    (u'I', u'confess', u'also'):0.00657894736842
    (u'I', u'confess', u'there'):0.00657894736842
    (u'I', u'confess', u',"'):0.0328947368421
    (u'I', u'confess', u'that'):0.164473684211
    (u'I', u'confess', u'"--'):0.00657894736842
    (u'I', u'confess', u'it'):0.0328947368421
    (u'I', u'confess', u';'):0.00657894736842
    (u'I', u'confess', u','):0.269736842105
    (u'I', u'confess', u'I'):0.164473684211
    (u'I', u'confess', u'unto'):0.00657894736842
    (u'I', u'confess', u'is'):0.00657894736842
    0.723684210526

这个总和 (.72) 小于 1 的原因是概率仅针对语料库中出现的第一个单词是“I”，第二个单词是“confess”的三元组进行计算。剩余的 .28 概率保留给语料库中不跟在“I”和“confess”之后的 w_i。这是平滑的全部要点，将出现在语料库中的 ngram 的一些概率质量重新分配给那些没有出现的 ngram，这样您就不会得到一堆概率为 0 的 ngram。

也不行



    ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!")

计算八卦？我认为这需要被标记化以计算单词三元组。

关于python - 使用 Python NLTK 对 trigrams 进行 Kneser-Ney 平滑，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35242155/

python - 使用 Python NLTK 对 trigrams 进行 Kneser-Ney 平滑

上一篇：python - ImportError : No module named numpy. distutils.core(Ubuntu xgboost 安装)

下一篇：python - 在 virtualenv 中全新安装后缺少 Django 管理/模板/文件夹