python - 朴素贝叶斯 nltk python 中如何计算最多信息的特征百分比？

当我们运行以下命令时，我们通常会得到以下结果:-

 classifier.show_most_informative_features(10)

结果:

Most Informative Features
             outstanding = 1                 pos : neg    =     13.9 : 1.0
               insulting = 1                 neg : pos    =     13.7 : 1.0
              vulnerable = 1                 pos : neg    =     13.0 : 1.0
               ludicrous = 1                 neg : pos    =     12.6 : 1.0
             uninvolving = 1                 neg : pos    =     12.3 : 1.0
              astounding = 1                 pos : neg    =     11.7 : 1.0

有人知道 13.9、13.7 等是如何计算的吗？

此外，我们可以使用以下方法 classifier.show_most_informative_features(10) 和朴素贝叶斯获得最多信息的特征，但如果我们想使用逻辑回归获得相同的结果，请有人建议获得该方法的方法。我在 stackoverflow 上看到一篇帖子，但它需要矢量，我没有使用它来创建特征。

classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Original Naive bayes accuracy percent: ", nltk.classify.accuracy(classifier,dev_set)* 100)
classifier.show_most_informative_features(10)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(train_set)
print("LogisticRegression  accuracy percent: ", nltk.classify.accuracy(LogisticRegression_classifier, dev_set)*100)

最佳答案

the Naive Bayes classifier in NLTK 的信息量最大的特征记录如下:

def most_informative_features(self, n=100):
    """
    Return a list of the 'most informative' features used by this
    classifier.  For the purpose of this function, the
    informativeness of a feature ``(fname,fval)`` is equal to the
    highest value of P(fname=fval|label), for any label, divided by
    the lowest value of P(fname=fval|label), for any label:
    |  max[ P(fname=fval|label1) / P(fname=fval|label2) ]
    """
    # The set of (fname, fval) pairs used by this classifier.
    features = set()
    # The max & min probability associated w/ each (fname, fval)
    # pair.  Maps (fname,fval) -> float.
    maxprob = defaultdict(lambda: 0.0)
    minprob = defaultdict(lambda: 1.0)

    for (label, fname), probdist in self._feature_probdist.items():
        for fval in probdist.samples():
            feature = (fname, fval)
            features.add(feature)
            p = probdist.prob(fval)
            maxprob[feature] = max(p, maxprob[feature])
            minprob[feature] = min(p, minprob[feature])
            if minprob[feature] == 0:
                features.discard(feature)

    # Convert features to a list, & sort it by how informative
    # features are.
    features = sorted(features,
                      key=lambda feature_:
                      minprob[feature_]/maxprob[feature_])
    return features[:n]

在二进制分类('pos' vs 'neg')的情况下，您的特征来自一元词袋 (BoW) 模型，most_informative_features() 返回的“信息值” 单词 outstanding 的函数等于:

 p('outstanding'|'pos') / p('outstanding'|'neg')

该函数遍历所有特征(在 unigram BoW 模型的情况下，特征是词)，然后取前 100 个具有最高“信息值”的词。

给定标签的单词概率在 train() function 中计算使用来自 ELEProbDist 的预期似然估计这是一个 LidstoneProbDist gamma 参数设置为 0.5 的引擎盖下的对象，它确实:

class LidstoneProbDist(ProbDistI):
    """
    The Lidstone estimate for the probability distribution of the
    experiment used to generate a frequency distribution.  The
    "Lidstone estimate" is parameterized by a real number *gamma*,
    which typically ranges from 0 to 1.  The Lidstone estimate
    approximates the probability of a sample with count *c* from an
    experiment with *N* outcomes and *B* bins as
    ``c+gamma)/(N+B*gamma)``.  This is equivalent to adding
    *gamma* to the count for each bin, and taking the maximum
    likelihood estimate of the resulting frequency distribution.
    """

关于python - 朴素贝叶斯 nltk python 中如何计算最多信息的特征百分比？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47017288/

python - 朴素贝叶斯 nltk python 中如何计算最多信息的特征百分比？

上一篇：python - Pandas 根据每行的现有列获取新列的 bool 值

下一篇：python - 在循环中比较python列表中的数据，直到不能再成对为止