python - 机器学习随机森林

我正在尝试使用 scikit-learn Python 库在不平衡数据集上拟合随机森林分类器。

我的目标是获得或多或少相同的召回率和精度值，为此，我使用 RandomForestClassifier 函数的 class_weight 参数。

当使用 class_weight = {0:1, 1:1} 拟合随机森林时(换句话说，假设数据集不不平衡)，我得到:

准确度:0.79 精度:0.63 召回率:0.32 曲线下面积:0.74

当我将 class_weight 更改为 {0:1, 1:10} 时，我得到:

准确度:0.79 精度:0.65 召回率:0.29 曲线下面积:0.74

所以，召回率和精度值几乎没有改变(即使我从 10 增加到 100，变化也很小)。

由于 X_train 和 X_test 均以相同比例不平衡(数据集超过 100 万行)，因此在使用 class_weight = {0:1, 1:10 时，我是否应该获得截然不同的召回率和精度值}?

最佳答案

如果您想提高模型的召回率，有一种更快的方法。

您可以计算precision recall curve使用sklearn。

这条曲线将为您提供模型精度和召回率之间的权衡。

这意味着，如果您想提高模型的召回率，您可以要求随机森林检索每个类别的概率，为类别 1 添加 0.1，并为类别 0 的概率减去 0.1。这将有效增加你的记忆

如果绘制精度召回曲线，您将能够找到同等精度和召回率的最佳阈值

这里有 sklearn 的示例

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
                                                    test_size=.5,
                                                    random_state=random_state)

# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.utils.fixes import signature

precision, recall, _ = precision_recall_curve(y_test, y_score)

# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
               if 'step' in signature(plt.fill_between).parameters
               else {})
plt.step(recall, precision, color='b', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

这应该给你类似 this 的东西

关于python - 机器学习随机森林，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53091838/

python - 机器学习随机森林

上一篇：python - 整数到位数组并返回整数(基于 RGB 图像中的位值的(多)单热编码)

下一篇：machine-learning - 异常检测和异常值差异