machine-learning - 用于文本分类的 nltk naivebayes 分类器

在下面的代码中，我知道我的 naivebayes 分类器工作正常，因为它在 trainset1 上工作正常，但为什么它不能在 trainset2 上工作？我什至在两个分类器上进行了尝试，一个来自 TextBlob，另一个直接来自 nltk。

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk

trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
         ('hello i was there and no one came', 'class2'),
         ('all negative terms like sad angry etc', 'class2')]

def nltk_naivebayes(trainset, test_sentence):
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
    classifier = nltk.NaiveBayesClassifier.train(t)
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
    return classifier.classify(test_sent_features)

def textblob_naivebayes(trainset, test_sentence):
    cl = NaiveBayesClassifier(trainset)
    blob = TextBlob(test_sentence,classifier=cl)
    return blob.classify() 

test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"

print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)

输出:

neg
class2
neg
class2

尽管 test_sentence2 显然属于 class1。

最佳答案

我假设您明白，您不能指望分类器仅通过 3 个示例就可以学习一个好的模型，并且您的问题更多的是要理解为什么它在这个特定示例中这样做。

这样做的可能原因是朴素贝叶斯分类器使用先验类别概率。也就是说，无论文本如何，neg 与 pos 的概率。在你的例子中，2/3 的例子是负面的，因此先验的负数为 66%，正数为 33%。单个积极实例中的积极词是“周年纪念”和“飙升”，这不太可能足以补偿这个先前类别的概率。

特别要注意的是，单词概率的计算涉及各种“平滑”函数(例如，每个类别中将使用 log10(TermFrequency + 1)，而不是 log10(TermFrequency)，以防止低频单词被对分类结果影响太大，除以零等。因此，“周年”和“飙升”的概率对于 neg 不是 0.0，对于 pos 不是 1.0，这与您的预期不同。

关于machine-learning - 用于文本分类的 nltk naivebayes 分类器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39351735/

machine-learning - 用于文本分类的 nltk naivebayes 分类器

上一篇：machine-learning - Tensorflow 中的成本敏感型学习

下一篇：python-2.7 - 如何使用 scikit-learn 只删除多项式回归中的交互项？