python - 随机森林多类 : what is shap_values[i]? 的 SHAP TreeExplainer

我正在尝试绘制 SHAP
这是我的代码 rnd_clf是 RandomForestClassifier :

import shap 
explainer = shap.TreeExplainer(rnd_clf) 
shap_values = explainer.shap_values(X) 
shap.summary_plot(shap_values[1], X)

我明白 shap_values[0]为负且 shap_values[1]是积极的。
但是对于多类 RandomForestClassifier 呢？我有 rnd_clf分类之一:

['Gusto','Kestrel 200 SCI Older Road Bike', 'Vilano Aluminum Road Bike 21 Speed Shimano', 'Fixie'].

我如何确定 shap_values[i] 的哪个索引对应于我输出的哪一类？

最佳答案

How do I determine which index of shap_values[i] corresponds to which class of my output?

shap_values[i]是第 i 个类的 SHAP 值。什么是第 i 个类更多的是您使用的编码模式的问题:LabelEncoder , pd.factorize ，等等。
您可以尝试以下方法作为线索:

from sklearn.preprocessing import LabelEncoder

labels = [
    "Gusto",
    "Kestrel 200 SCI Older Road Bike",
    "Vilano Aluminum Road Bike 21 Speed Shimano",
    "Fixie",
]
le = LabelEncoder()
y = le.fit_transform(labels)
encoding_scheme = dict(zip(y, labels))
pprint(encoding_scheme)

{0: 'Fixie',
 1: 'Gusto',
 2: 'Kestrel 200 SCI Older Road Bike',
 3: 'Vilano Aluminum Road Bike 21 Speed Shimano'}

所以，例如 shap_values[3]对于这种特殊情况，适用于 'Vilano Aluminum Road Bike 21 Speed Shimano'为了进一步了解如何解释 SHAP 值，让我们为具有 100 个特征和 10 个类的多类分类准备一个合成数据集:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap import summary_plot

X, y = make_classification(1000, 100, n_informative=8, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)

(750, 100)

此时我们有 750 行、100 个特征和 10 个类的训练数据集。
让我们训练 RandomForestClassifier并将其提供给 TreeExplainer :

clf = RandomForestClassifier(n_estimators=100, max_depth=3)
clf.fit(X_train, y_train)
explainer = TreeExplainer(clf)
shap_values = np.array(explainer.shap_values(X_train))
print(shap_values.shape)

(10, 750, 100)

10 : number of classes. All SHAP values are organized into 10 arrays, 1 array per class.
750 : number of datapoints. We have local SHAP values per datapoint.
100 : number of features. We have SHAP value per every feature.

例如，对于 Class 3你将拥有:

print(shap_values[3].shape)

(750, 100)

750: SHAP values for every datapoint
100: SHAP value contributions for every feature

最后，您可以运行健全性检查以确保模型的真实预测与 shap 的预测相同。 .
为此，我们将 (1) 交换 shap_values 的前两个维度, (2) 将所有特征的每个类的 SHAP 值相加，(3) 将 SHAP 值添加到基值:

shap_values_ = shap_values.transpose((1,0,2))

np.allclose(
    clf.predict_proba(X_train),
    shap_values_.sum(2) + explainer.expected_value
)

True

然后你可以继续到summary_plot这将显示基于每个类的 SHAP 值的特征排名。对于第 3 类，这将是:

summary_plot(shap_values[3],X_train)

其解释如下:

For class 3 most influential features based on SHAP contributions are 16,59,24

For feature 15 lower values tend to result in higher SHAP values (hence higher probability of the class label)

Features 50, 45, 48 are least influential out of 20 displayed

关于python - 随机森林多类 : what is shap_values[i]? 的 SHAP TreeExplainer，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65549588/

python - 随机森林多类 : what is shap_values[i]? 的 SHAP TreeExplainer

上一篇：julia - 在 Julia Plotly.jl 中制作子图

下一篇：html - 如何使用 CSS 动画从左到右移动元素？