machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率

标签 machine-learning random-forest

我意识到相关问题Positives/negatives proportion in train set建议 1:1 的正负训练示例比例有利于 Rocchio 算法。

但是,该问题与相关问题的不同之处在于它涉及随机森林模型,并且还存在以下两个方面。

1) 我有大量的训练数据可供使用,使用更多训练示例的主要瓶颈是训练迭代时间。也就是说,我不想花超过一晚的时间来训练一个排名器,因为我想快速迭代。

2)在实践中,分类器可能会看到每 4 个反例就有 1 个正例。

在这种情况下,我应该使用比正例更多的负例进行训练,还是使用相同数量的正例和负例?

最佳答案

请参阅随机森林官方文档中标题为“平衡预测误差”的部分:https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

我用粗体标记了一些部分。

总之,这似乎表明您的训练和测试数据应该是

  1. 反射(reflect)真实数据的类别比例为 1:4 或
  2. 您可以进行 1:1 混合,但之后您应该仔细调整 每个类别的权重如下所示,直到 OOB 错误率达到 您想要的(较小的)类(class)已降低

希望有帮助。

In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.

The final output of a forest of 500 trees on this data is:

500 3.7 0.0 78.4

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.

The error balancing can be done by setting different weights for the classes.

The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:

500 12.1 12.7 0.0

The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.

Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.

关于machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17905205/

相关文章:

python - 与 MSE 相比,为什么使用 MAE 标准训练随机森林回归器这么慢?

r - randomForest R 包的奇怪结果

python - python中的联合平均实现

python-3.x - 如何获得离簇中心最近的N个数据点?

python-2.7 - 神经网络中隐藏层神经元的数量应该如何设置?

machine-learning - from_model.py 中的 SelectFromModel() 如何工作?

python - 下面的xgboost模型 TreeMap 中 'leaf'的值是什么意思?

python - 如何使用 Python 使用最近邻算法对数据进行分类?

python-3.x - scikit-learn 中针对大量特征的特征选择

python - 得分为 ='roc_auc' 的 cross_val_score 和 roc_auc_score 有什么区别?