python - 分解适合 python 的随机森林分类？

我有将近 900,000 行信息要通过 scikit-learn 的随机森林分类器算法运行。问题是，当我尝试创建模型时，我的计算机完全死机了，所以我想尝试每 50,000 行运行一次模型，但我不确定这是否可行。

所以我现在的代码是

# This code freezes my computer
rfc.fit(X,Y)

#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on

最佳答案

如果我错了，请随时纠正我，但我假设您没有使用最新版本的 scikit-learn(撰写本文时为 0.16.1)，您在 Windows 机器上使用 n_jobs=-1(或三者的组合)。所以我的建议是先升级 scikit-learn 或设置 n_jobs=1 并尝试拟合整个数据集。

如果失败，请查看 warm_start 参数。通过将其设置为 True 并逐渐递增 n_estimators，您可以在数据子集上添加更多树:

# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])

# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])

# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])

另一种可能性是在每个 block 上安装一个新的分类器，然后简单地平均所有分类器的预测或将树合并到一个大的随机森林中，如 described here .

关于python - 分解适合 python 的随机森林分类？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30742727/

python - 分解适合 python 的随机森林分类？

上一篇：python - 在 Django 的管理器中验证对象属性

下一篇：python - 计算 Pandas GroupBy 对象中的日期差异