MATLAB 有一个很好的 silhouette function帮助评估 k-means 的聚类数。是否也有 Python 的 Numpy/Scipy 的等效项？

最佳答案

我在下面展示了一个示例 silhouette在 MATLAB 和 Python/Numpy 中实现(请记住，我更擅长 MATLAB):

1) 软件

function s = mySilhouette(X, IDX)
    %# X  : matrix of size N-by-p, data where rows are instances
    %# IDX: vector of size N, cluster index of each instance (starting from 1)
    %# s  : vector of size N, silhouette score value of each instance

    N = size(X,1);            %# number of instances
    K = numel(unique(IDX));   %# number of clusters

    %# compute pairwise distance matrix
    D = squareform( pdist(X,'euclidean').^2 );

    %# indices belonging to each cluster
    kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});

    %# compute a,b,s for each instance
    %# a(i): average distance from i to all other data within the same cluster.
    %# b(i): lowest average dist from i to the data of another single cluster
    a = zeros(N,1);
    b = zeros(N,1);
    for i=1:N
        ind = kIndices{IDX(i)}; ind = ind(ind~=i);
        a(i) = mean( D(i,ind) );
        b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
    end
    s = (b-a) ./ max(a,b);
end

模拟 silhouette 中的情节在 MATLAB 中的函数，我们将剪影值按簇分组，在每个簇内排序，然后水平绘制条形图。 MATLAB 添加了 NaN 来将条形图与不同的簇分开，我发现简单地对条形图进行颜色编码会更容易:

%# sample data
load fisheriris
X = meas;
N = size(X,1);

%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);

%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d');           %#'

h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')

%# compare against SILHOUETTE
figure, silhouette(X,IDX)

mySilhouette

2) python

这是我用 Python 得出的结论:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm

def silhouette(X, cIDX):
    """
    Computes the silhouette score for each instance of a clustered dataset,
    which is defined as:
        s(i) = (b(i)-a(i)) / max{a(i),b(i)}
    with:
        -1 <= s(i) <= 1

    Args:
        X    : A M-by-N array of M observations in N dimensions
        cIDX : array of len M containing cluster indices (starting from zero)

    Returns:
        s    : silhouette value of each observation
    """

    N = X.shape[0]              # number of instances
    K = len(np.unique(cIDX))    # number of clusters

    # compute pairwise distance matrix
    D = squareform(pdist(X))

    # indices belonging to each cluster
    kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]

    # compute a,b,s for each instance
    a = np.zeros(N)
    b = np.zeros(N)
    for i in range(N):
        # instances in same cluster other than instance itself
        a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
        # instances in other clusters, one cluster at a time
        b[i] = np.min( [np.mean(D[i][ind]) 
                        for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
    s = (b-a)/np.maximum(a,b)

    return s

def main():
    # load Iris dataset
    data = datasets.load_iris()
    X = data['data']

    # cluster and compute silhouette score
    K = 3
    C, cIDX = kmeans2(X, K)
    s = silhouette(X, cIDX)

    # plot
    order = np.lexsort((-s,cIDX))
    indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
    ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
    ytickLabels = ["%d" % x for x in range(K)]
    cmap = cm.jet( np.linspace(0,1,K) ).tolist()
    clr = [cmap[i] for i in cIDX[order]]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.barh(range(X.shape[0]), s[order], height=1.0, 
            edgecolor='none', color=clr)
    ax.set_ylim(ax.get_ylim()[::-1])
    plt.yticks(ytick, ytickLabels)
    plt.xlabel('Silhouette Value')
    plt.ylabel('Cluster')
    plt.show()

if __name__ == '__main__':
    main()

python_mySilhouette

更新:

正如其他人所指出的，scikit-learn 从那时起就添加了自己的 silhouette metric implementation .要在上面的代码中使用它，请将对自定义 silhouette 函数的调用替换为:

from sklearn.metrics import silhouette_samples

...

#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX)    # <-- scikit-learn function

...

其余代码仍可按原样使用以生成完全相同的图。

关于python - 等价于 Matlab 的聚类质量函数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6644445/

python - 等价于 Matlab 的聚类质量函数？

1) 软件

2) python

更新:

上一篇：python - 从 Python 调用 PowerShell 脚本

下一篇：python - 我可以将 float128 设置为 numpy 中的标准 float 组吗