machine-learning - scikit-learn RandomForestClassifier 中的子样本大小

标签 machine-learning scikit-learn random-forest data-science

如何控制用于训练森林中每棵树的子样本的大小？根据scikit-learn的文档:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

所以bootstrap允许随机性，但找不到如何控制子样本的数量。

最佳答案

Scikit-learn 不提供此功能，但您可以通过使用树和装袋元分类器组合的(较慢)版本轻松获得此选项:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)

顺便说一句，Breiman 的随机森林确实不考虑子样本作为参数，完全依赖 bootstrap，因此大约使用 (1 - 1/e) 的样本来构建每棵树。

关于machine-learning - scikit-learn RandomForestClassifier 中的子样本大小，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40847745/

上一篇：machine-learning - 关于马尔可夫链的困惑

下一篇：machine-learning - 多个 HDF5 文件的 HDF5 数据层定义

python - 具有不确定点的最近邻

machine-learning - 操纵随机森林来生成分数而不是 0/1 标签

python - sklearn 中的轮廓系数子采样是否分层？

r - H2O 中的集成(随机森林)-多项分布

r - 为什么重要性参数会影响 R 中随机森林的性能？

python-3.x - SelectKBest 以 nan 值的形式给出分数

Tensorflow 对象检测 API 教程错误

Tensorflow 存储学习

python - 用于生产时的 Sklearn MultiLabelBinarizer() 错误