python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型)

标签 python nlp smoothing

在下面的代码中,我尝试根据基于固定折扣的 Knesr-Kney 平滑方法计算三元组的概率。我浏览了描述克内斯-克尼的重要论文 Goodman &ChenDan Jurafsky 。关于堆栈交换的这个[问题]( https://stats.stackexchange.com/questions/114863/in-kneser-ney-smoothing-how-are-unseen-words-handled )是对二元语法案例的一个很好的总结。

我发现很难从三元组案例的数学形式中驱动 Kneser-Ney 的实现,因为它们相当复杂且难以消化。 经过长时间的搜索,我找不到该方法的代码解释。

我假设一个封闭词汇表并且想要检查此代码是否是正确的实现?

具体来说,函数 score_trigram(self,tri_g) 将一个三元组作为元组 ('u','v','w') 并尝试计算其概率的对数,根据克尼西-克尼的说法。 init 方法中显示的字典存储基于某些语料库学习到的一元词、二元词、三元词的频率。

假设这些频率计数已正确初始化并给出。

如果我们有一个三元组 (a,b,c),那么对于非零计数的三元组情况,Kneser-kney 的高级公式:

P((a,b,c)) = P_ML_discounted((a,b,c)) + Total_discount_1 * P_KN((b,c))

P_ML_discounted((a,b,c)) = 计数((a,b,c)) - 折扣/计数((a,b))

total_discount_1 = 折扣 * follow_up_count((a,b))/计数((a,b))

P_KN((b,c)) = ((b,c)) 的连续计数/唯一三元组计数 + 总折扣_2 *P_KN(c)

total_discount_2 = 折扣+follow_up_count(b)/count_unique_bigrams

P_KN(c) = continuation_count(c) - 折扣/count_unique_bigrams + 折扣*1/vocabulary_size

我有两个问题:
1- 前面的方程对于 Knesery-Kney trigram 情况是否正确?

2-代码中对应的评分函数是否正确实现?

类自定义语言模型:

def __init__(self, corpus):
    """Initialize your data structures in the constructor."""
    ### n-gram counts
    # trigram dict entry > ('word_a','word_b','word_c') : 10
    self.trigramCounts = collections.defaultdict(lambda: 0)

    # bigram dict entry > ('word_a','word_b') : 11
    self.bigramCounts = collections.defaultdict(lambda: 0)

    # unigram dict entry > 'word_a' : 15
    self.unigramCounts = collections.defaultdict(lambda: 0)

    ###Kneser-kney(KN) counts

    '''The follow_up count of a bi-gram (a,b) is the number of unique tri-grams 
    starts with (a,b), for example if the frequency of (a,b,c) tri-gram is 3,
    this increments the follow_up count of (a,b) by one,also if the frequency
    of (a,b,d) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as >  ('word_a','word_b') : 7
    self.bigram_follow_up_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a bigram (y,z) is the number of unique trigrams
    ends with (y,z), for example if the frequency of (x,y,z) trigram is 3,
    this increments the continuation count of (y,z) by one,
    also if the frequency of (r,y,z) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as > ('word_a','word_b') : 5
    self.bigram_continuation_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a unigam 'z' is the number of unique bigrams ends
    with 'z',for example if the frequency of ('y','z') bigram is 3, this increments 
    the continuation count of 'z' by one. Also if the frequency of ('w','z') is 5,
    this adds one to the continuation count of 'z'.
    '''
    # dict entry as >  'word_z' : 5
    self.unigram_continuation_count = collections.defaultdict(lambda: 0)

    '''The follow-up count of a unigam 'a' is the number of unique bigrams starts
    with 'a',for example if the frequency of ('a','b') bigram is 3, this increments
    the continuation count of 'a' by one. Also if the frequency of ('a','c') is 5,
    this adds one to the continuationcount of 'a'. '''
    # dict entry as >  'word_a' : 5
    self.unigram_follow_up_count = collections.defaultdict(lambda: 0)

    # total number of words, fixed discount
    self.total =0 , self.d=0.75 ,self.train(corpus)

def train(self, corpus):
    # count and initialize the dictionaries
    pass
def score_trigram(self,tri_g): 

    score = 0.0 , w1 = tri_g[0], w2 = tri_g[1] , w3 = tri_g[2]
    # use the trigram if it has a frequency > 0
    if self.trigramCounts[(w1,w2,w3)] > 0 and self.bigramCounts[(w1,w2)] > 0 :
        score += self.top_level_trigram_prob(*tri_g)
    # otherwise use the bigram (w2,w3) as an approximation
    else :
        if self.bigramCounts[(w2,w3)] > 0  and self.unigramCounts[w2]> 0:
            score = score + self.top_level_bigram_prob(w2,w3)
        ## otherwise use the unigram w3 as an approximation
        else:
            score += math.log(self.pkn_unigram(w3))               
    return score

def top_level_trigram_prob(self,w1,w2,w3):
    score=0.0
    term1 = max(self.trigramCounts[(w1,w2,w3)]-self.d,0)/self.bigramCounts[(w1,w2)]
    alfa = self.d * self.bigram_follow_set[(w1,w2)] / len(self.bigram_follow_set)
    term2 = self.pkn_bigram(w2,w3)
    score += math.log(term1+ alfa* term2)
    return score  

def top_level_bigram_prob(self,w1,w2):
    score=0.0
    term1 = max(self.bigramCounts[(w1,w2)]-self.d,0)/self.unigramCounts[w1]
    alfa = self.d * self.unigram_follow_set[w1]/self.unigramCounts[w1]
    term2 = self.pkn_unigram (w2)
    score += math.log(term1+ alfa* term2)
    return score 

def pkn_bigram(self,w1,w2):           
    return self.pkn_bigram_contuation(w1,w2) + self.pkn_bigram_follow_up(w1) * self.pkn_unigram(w2)


def pkn_bigram_contuation (self,w1,w2):
    ckn= self.bigram_continuation_dict[(w1,w2)]
    term1 = (max(ckn -self.d,0)/len(self.bigram_continuation_dict))        
    return term1

def pkn_bigram_follow_up (self,w1):
    ckn = self.unigram_follow_dict[w1]
    alfa = self.d * ckn / len(self.bigramCounts)
    return alfa  

def pkn_unigram (self,w1):
    #continuation of w1 + lambda uniform
    ckn= self.unigram_continuation_dict[w1]
    p_cont= float(max(ckn - self.d,0)) / len(self.bigramCounts)+ 1.0/len(self.unigramCounts )
    return p_cont

最佳答案

我来回答你的第一个问题。

下面我标记的是你的方程(我纠正了你在(5)中的拼写错误,并根据你的代码在(2)和(6)中添加了 max(,0) )

(1) P((a,b,c)) = P_ML_discounted((a,b,c)) + Total_discount_1 * P_KN((b,c))

(2) P_ML_discounted((a,b,c)) = max(count((a,b,c)) - 折扣, 0)/count((a,b))

(3) Total_discount_1 = 折扣 * follow_up_count((a,b))/计数((a,b))

(4) P_KN((b,c)) = ((b,c)) 的连续计数/唯一三元组计数 + 总折扣_2 *P_KN(c)

(5) Total_discount_2 = 折扣 * follow_up_count(b)/count_unique_bigrams

(6) P_KN(c) = max(continuation_count(c) - 折扣, 0)/count_unique_bigrams + 折扣*1/vocabulary_size

关于上式的正确性:

(1)~(3):正确

(4) (5):不正确。在这两个方程中,count_of_unique_trigrams 应替换为“第二个单词为 b 的唯一三元组的计数”,即形式为 (,b,) 的唯一三元组计数。

我在你的代码中看到, pkn_bigram_contuation() 确实对 ((b,c)) 的 continuation_count 进行了折扣,这是正确的。不过,它没有反射(reflect)在您的等式 (4) 中。

(6) 我认为您正在使用 Dan Jurafsky 中的实现方程 (4.37) 。问题是作者不清楚如何计算 lambda(epsilon) 以使一元概率正确归一化。

实际上,一元词概率不需要打折扣(参见第 5 页标题为“Kneser-Ney 详细信息”的幻灯片 here ),因此 (6) 可以简单地表示为

P_KN(c) = continuation_count(c)/count_unique_bigrams。

关于python - "Third order"Kneser-Key 平滑的正确实现(对于 Trigram 模型),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36477499/

相关文章:

Python Selenium Webdriver 在 "data-original-title"中获取文本

python - 如何在 Python 中操作笛卡尔坐标?

machine-learning - 是否应该从 Rasa NLU 训练数据中删除标点符号?

nlp - 如何使用 Langchain 从 PDF 文档中提取结构化数据,并将该数据用作 ChatGPT 的输入

java - 按部首组织的汉字索引。斯坦福核心自然语言处理

javascript - 检测 Javascript 中字体平滑设置的变化

python - KL(Kullback-Leibler)距离与 Python 中的直方图平滑

python - 值错误: "hostingstart.app" could not be imported

python - IndexError 使用 pandas pivot_table 方法

algorithm - 按条件进行多边形平滑