当我们运行以下命令时,我们通常会得到以下结果:-
classifier.show_most_informative_features(10)
结果:
Most Informative Features
outstanding = 1 pos : neg = 13.9 : 1.0
insulting = 1 neg : pos = 13.7 : 1.0
vulnerable = 1 pos : neg = 13.0 : 1.0
ludicrous = 1 neg : pos = 12.6 : 1.0
uninvolving = 1 neg : pos = 12.3 : 1.0
astounding = 1 pos : neg = 11.7 : 1.0
有人知道 13.9、13.7 等是如何计算的吗?
此外,我们可以使用以下方法 classifier.show_most_informative_features(10) 和朴素贝叶斯获得最多信息的特征,但如果我们想使用逻辑回归获得相同的结果,请有人建议获得该方法的方法。我在 stackoverflow 上看到一篇帖子,但它需要矢量,我没有使用它来创建特征。
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Original Naive bayes accuracy percent: ", nltk.classify.accuracy(classifier,dev_set)* 100)
classifier.show_most_informative_features(10)
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(train_set)
print("LogisticRegression accuracy percent: ", nltk.classify.accuracy(LogisticRegression_classifier, dev_set)*100)
最佳答案
the Naive Bayes classifier in NLTK 的信息量最大的特征记录如下:
def most_informative_features(self, n=100):
"""
Return a list of the 'most informative' features used by this
classifier. For the purpose of this function, the
informativeness of a feature ``(fname,fval)`` is equal to the
highest value of P(fname=fval|label), for any label, divided by
the lowest value of P(fname=fval|label), for any label:
| max[ P(fname=fval|label1) / P(fname=fval|label2) ]
"""
# The set of (fname, fval) pairs used by this classifier.
features = set()
# The max & min probability associated w/ each (fname, fval)
# pair. Maps (fname,fval) -> float.
maxprob = defaultdict(lambda: 0.0)
minprob = defaultdict(lambda: 1.0)
for (label, fname), probdist in self._feature_probdist.items():
for fval in probdist.samples():
feature = (fname, fval)
features.add(feature)
p = probdist.prob(fval)
maxprob[feature] = max(p, maxprob[feature])
minprob[feature] = min(p, minprob[feature])
if minprob[feature] == 0:
features.discard(feature)
# Convert features to a list, & sort it by how informative
# features are.
features = sorted(features,
key=lambda feature_:
minprob[feature_]/maxprob[feature_])
return features[:n]
在二进制分类('pos' vs 'neg')的情况下,您的特征来自一元词袋 (BoW) 模型,most_informative_features() 返回的“信息值”
单词 outstanding
的函数等于:
p('outstanding'|'pos') / p('outstanding'|'neg')
该函数遍历所有特征(在 unigram BoW 模型的情况下,特征是词),然后取前 100 个具有最高“信息值”的词。
给定标签的单词概率在 train()
function 中计算使用来自 ELEProbDist
的预期似然估计这是一个 LidstoneProbDist
gamma
参数设置为 0.5 的引擎盖下的对象,它确实:
class LidstoneProbDist(ProbDistI):
"""
The Lidstone estimate for the probability distribution of the
experiment used to generate a frequency distribution. The
"Lidstone estimate" is parameterized by a real number *gamma*,
which typically ranges from 0 to 1. The Lidstone estimate
approximates the probability of a sample with count *c* from an
experiment with *N* outcomes and *B* bins as
``c+gamma)/(N+B*gamma)``. This is equivalent to adding
*gamma* to the count for each bin, and taking the maximum
likelihood estimate of the resulting frequency distribution.
"""
关于python - 朴素贝叶斯 nltk python 中如何计算最多信息的特征百分比?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47017288/