python - 朴素贝叶斯分类器中的错误

标签 python machine-learning classification

我是机器学习的初学者,我正在尝试自己实现我的第一个朴素贝叶斯以便更好地理解。所以,我有来自 http://archive.ics.uci.edu/ml/datasets/Adult 的数据集(美国人口普查数据,类别为“<=50k”和“>50k”)。

这是我的Python代码:

#!/usr/bin/python

import sys
import csv

words_stats = {} # {'word': {'class1': cnt, 'class2': cnt'}}
words_cnt = 0

targets_stats = {} # {'class1': 3234, 'class2': 884} how many words in each class
class_stats = {} # {'class1': 7896, 'class2': 3034} how many lines in each class
items_cnt = 0

def train(dataset, targets):
    global words_stats, words_cnt, targets_stats, items_cnt, class_stats

    num = len(dataset)
    for item in xrange(num):
        class_stats[targets[item]] = class_stats.get(targets[item], 0) + 1

        for i in xrange(len(dataset[item])):
            word = dataset[item][i]
            if not words_stats.has_key(word):
                words_stats[word] = {}

            tgt = targets[item]

            cnt = words_stats[word].get(tgt, 0)
            words_stats[word][tgt] = cnt + 1

            targets_stats[tgt] = targets_stats.get(tgt, 0) + 1
            words_cnt += 1

    items_cnt = num

def classify(doc, tgt_set):
    global words_stats, words_cnt, targets_stats, items_cnt

    probs = {} #the probability itself P(c|W) = P(W|c) * P(c) / P(W)
    pc = {} #probability of the class in document set P(c)
    pwc = {} #probability of the word set in particular class. P(W|c)
    pw = 1 #probability of the word set in documet set

    for word in doc:
        if word not in words_stats:
            continue #dirty, very dirty 
        pw = pw * float(sum(words_stats[word].values())) / words_cnt

    for tgt in tgt_set:
        pc[tgt] = class_stats[tgt] / float(items_cnt)
        for word in doc:
            if word not in words_stats:
                continue #dirty, very dirty
            tgt_wrd_cnt = words_stats[word].get(tgt, 0)
            pwc[tgt] = pwc.get(tgt, 1) * float(tgt_wrd_cnt) / targets_stats[tgt]

        probs[tgt] = (pwc[tgt] * pc[tgt]) / pw

    l = sorted(probs.items(), key = lambda i: i[1], reverse=True)
    print probs
    return l[0][0]

def check_results(dataset, targets):
    num = len(dataset)
    tgt_set = set(targets)
    correct = 0
    incorrect = 0

    for item in xrange(num):
        res = classify(dataset[item], tgt_set)
        if res == targets[item]:
            correct = correct + 1
        else:
            incorrect = incorrect + 1

    print 'correct:', float(correct) / num, ' incorrect:', float(incorrect) / num

def load_data(fil):
    data = []
    tgts = []

    reader = csv.reader(fil)
    for line in reader:
        d = [x.strip() for x in line]
        if '?' in d:
            continue

        if not len(d):
            continue

        data.append(d[:-1])
        tgts.append(d[-1:][0])

    return data, tgts

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print './program train_data.txt test_data.txt'
        sys.exit(1)

    filename = sys.argv[1]
    fil = open(filename, 'r')
    data, tgt = load_data(fil)
    train(data, tgt)

    test_file = open(sys.argv[2], 'r')
    test_data, test_tgt = load_data(test_file)

    check_results(test_data, tgt)

它给出了约 61% 的正确结果。当我打印概率时,我得到以下结果:

{'<=50K': 0.07371606889800396, '>50K': 15.325378327213354}

但如果分类器正确,我希望看到两个概率之和等于 1。 起初我认为问题出在浮点下溢中,并尝试以对数进行所有计算,但结果相似。 我知道省略一些单词会影响准确性,但概率是错误的。

我做错了什么或者不明白什么?

为了方便起见,我已在此处上传了数据集和 python 脚本: https://dl.dropboxusercontent.com/u/36180992/adult.tar.gz

感谢您的帮助。

最佳答案

朴素贝叶斯不会直接计算概率,而是计算一个“原始分数”,与每个标签的其他分数进行相对比较,以便对实例进行分类。这个分数可以很容易地转换为 [0, 1] 范围内的“概率”:

total = sum(probs.itervalues())
for label, score in probs.iteritems():
    probs[label] = score / total

但是,请记住,这仍然并不代表真实的概率,如本文 answer 中所述。 :

naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one.

关于python - 朴素贝叶斯分类器中的错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19349567/

相关文章:

python - Pandas 将列重复到新行中

python - 检查文件是否存在 : performance of isfile Vs open(path)

linux - tensorflow 和 openSUSE

c++ - 在 C++ 中实现多变量正态 pdf 以进行图像分类

python - 如何在python中检查ssh是否询问密码?

python - Django:匹配查询不存在且 django.core.exceptions.ImproperlyConfigured

machine-learning - 多层感知器 - 误差平台

algorithm - 给定属性和权重的评分算法

python - NLTK SklearnClassifier 错误

matlab - Mnist数据集模式识别准确率