python - 在Python/R中使用SLINK后分离集群

根据研究，只有单联层次聚类才能获得最优聚类。这也称为 SLINK。这些库最初以 C++ 发布，现在以 Python/R 发布。

到目前为止，按照文档中的步骤，我设法想出:

import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

## generating random numbers from 20 to 90, and storing them in a dataframe. This is a 1-dimensional data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(20,90,size=(100,1)), columns = list('A'))
df = df.sort_values(by=['A'])
df = df.values
df[:,0].sort()

## getting condensed distance matrix
d = pdist(df_final, metric='euclidean')

## running the SLINK algorithm
Z = linkage(d, 'single')

我知道 Z 是一个“编码为链接矩阵的分层聚类”(如文档中所写)，但我想知道如何返回原始数据集并区分由该结果计算的聚类？

我可以通过 Scikit-Learn 聚类来实现聚类结果，但我认为 Scikit-Learn 聚类算法不是最优的，因此我转向了这个 SLINK 算法。如果有人能帮助我，我将不胜感激。

最佳答案

从scipy.cluster.hierarchy.linkage中，您可以了解每次迭代中簇是如何形成的。

通常这些信息没有多大用处，所以我们可以先看一下聚类:

import scipy as scipy
import matplotlib.pyplot as plt
plt.figure()
dn =scipy.cluster.hierarchy.dendrogram(Z)

如果我们想得到这三个簇，我们可以这样做:

labels = scipy.cluster.hierarchy.fcluster(Z,3,'maxclust')

如果你想通过数据点之间的距离来获取它:

scipy.cluster.hierarchy.fcluster(Z,2,'distance')

这与调用 3 个集群的结果大致相同，因为切割此示例数据集的方法并不多。

如果你看一下你的例子，你可以切割的下一个点是在高度 ~ 1.5 处，即 16 个簇。因此，如果您尝试执行 scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')，您将获得与 3 个集群相同的结果。如果您有更广泛的数据集，它将起作用:

np.random.seed(111)
df = np.random.normal(0,1,(50,3))

## getting condensed distance matrix
d = pdist(df, metric='euclidean')
Z = linkage(d, 'single')
dn = scipy.cluster.hierarchy.dendrogram(Z,above_threshold_color='black',color_threshold=1.1)

然后这有效:

scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')

关于python - 在Python/R中使用SLINK后分离集群，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60016770/

python - 在Python/R中使用SLINK后分离集群

上一篇：python - 如果该列大于所述日期，如何绕过此错误以使该列为零？ "TypeError: invalid type promotion "

下一篇：python - 如何检测 pynput 中的按键/释放