machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率

我意识到相关问题Positives/negatives proportion in train set建议 1:1 的正负训练示例比例有利于 Rocchio 算法。

但是，该问题与相关问题的不同之处在于它涉及随机森林模型，并且还存在以下两个方面。

1) 我有大量的训练数据可供使用，使用更多训练示例的主要瓶颈是训练迭代时间。也就是说，我不想花超过一晚的时间来训练一个排名器，因为我想快速迭代。

2)在实践中，分类器可能会看到每 4 个反例就有 1 个正例。

在这种情况下，我应该使用比正例更多的负例进行训练，还是使用相同数量的正例和负例？

最佳答案

请参阅随机森林官方文档中标题为“平衡预测误差”的部分:https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

我用粗体标记了一些部分。

总之，这似乎表明您的训练和测试数据应该是

反射(reflect)真实数据的类别比例为 1:4 或
您可以进行 1:1 混合，但之后您应该仔细调整每个类别的权重如下所示，直到 OOB 错误率达到您想要的(较小的)类(class)已降低

希望有帮助。

In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.

The final output of a forest of 500 trees on this data is:

500 3.7 0.0 78.4

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.

The error balancing can be done by setting different weights for the classes.

The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:

500 12.1 12.7 0.0

The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.

Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.

关于machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17905205/

machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率

上一篇：machine-learning - 遗传算法，大群体与小群体

下一篇：r - 使用 naiveBayes (e1071) 进行分类不起作用($levels 返回 NULL)