python - Scikit K 均值聚类性能度量

我正在尝试使用 K-means 方法进行聚类，但我想衡量聚类的性能。

我不是专家，但我渴望了解有关聚类的更多信息。

这是我的代码:

import pandas as pd
from sklearn import datasets

#loading the dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data)

#K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Species']

#we merge this dataframe with df
prediction = pd.concat([df,pred], axis = 1)

#We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
k_list = [clus0.values, clus1.values,clus2.values]

现在我已经存储了我的 KMeans 和我的三个集群，我正在尝试使用 Dunn Index衡量我的聚类性能(我们寻求更大的指数) 为此，我导入了 jqm_cvi 包(可用 here )

from jqmcvi import base
base.dunn(k_list)

我的问题是:Scikit Learn 中是否已经存在任何聚类内部评估(silhouette_score 除外)？或者在另一个著名的图书馆？

最佳答案

除了 Silhouette Score 之外，Elbow Criterion 也可用于评估 K-Mean 聚类。它在 Scikit-Learn 中不可用作函数/方法。我们需要计算 SSE 以使用 Elbow 准则评估 K-Means 聚类。

Elbow Criterion 方法的思想是选择 SSE 突然下降的 k(no of cluster)。 SSE 定义为簇中每个成员与其质心之间距离的平方和。

计算每个 k 值的误差平方和 (SSE)，其中 k 是 no。簇 并绘制折线图。当我们增加 k 时，SSE 趋于向 0 减小(SSE=0，当 k 等于数据集中数据点的数量时，因为那时每个数据点都是它自己的簇，并且它与中心之间没有误差它的集群)。

因此，目标是选择仍然具有低 SSE 的小 k 值，肘部通常表示，我们开始通过增加 yield 递减k.

鸢尾花数据集示例:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]

sse = {}
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
    data["clusters"] = kmeans.labels_
    #print(data["clusters"])
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

如果折线图看起来像一条 ARM - 上面折线图中的红色圆圈(如角度)，则 ARM 上的“肘部”是 optimal k 的值(簇数)。根据上述折线图中的弯头，最优聚类数为3。

注意:肘部标准本质上是启发式的，可能不适用于您的数据集。根据数据集和您要解决的问题遵循直觉。

希望对您有所帮助!

关于python - Scikit K 均值聚类性能度量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43784903/

python - Scikit K 均值聚类性能度量

上一篇：python - Pandas 滚动申请不做任何事情

下一篇：python - 如何在 scikit-learn 中使用管道调整自定义内核函数的参数