machine-learning - 使用朴素贝叶斯进行文档分类

我对文档分类中使用的特定朴素贝叶斯算法有疑问。以下是我的理解:

为每个已知分类构建训练集中每个单词的概率
给定一个文档，我们删除它包含的所有单词
将分类中出现的单词的概率相乘
对每个分类执行 (3)
比较(4)的结果，选择后验最高的分类

我感到困惑的是我们计算给定训练集的每个单词的概率的部分。例如，对于单词“banana”，它出现在分类A中的100个文档中，并且A中总共有200个文档，总共有1000个单词出现在A中。要获得“banana”出现在分类A下的概率，我该怎么做使用 100/200=0.5 或 100/1000=0.1？

最佳答案

我相信，如果您计算单词出现的文档数量，而不是单词总共出现的次数，您的模型会更准确地进行分类。换句话说

对“提及水果”进行分类:

“我喜欢香蕉。”

称重不应大于或小于

“香蕉!香蕉!香蕉!我喜欢它们。”

所以你的问题的答案是 100/200 = 0.5。

维基百科上关于文档分类的描述也支持了我的结论

Then the probability that a given document D contains all of the words W, given a class C, is

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

换句话说，维基百科描述的文档分类算法测试给定文档包含多少分类单词列表。

顺便说一句，更先进的分类算法将检查 N 个单词的序列，而不仅仅是单独检查每个单词，其中 N 可以根据您愿意用于计算的 CPU 资源量来设置。

更新

我的直接经验是基于简短的文档。我想强调 @BenAllison 在评论中指出的研究，该研究表明我的答案对于较长的文档无效。具体

One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.

A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

关于machine-learning - 使用朴素贝叶斯进行文档分类，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13368118/

machine-learning - 使用朴素贝叶斯进行文档分类

上一篇：matlab - 编码分类树...如何存储？

下一篇：machine-learning - SVM - 向量和点之间的混淆