python - 使用基于距离的方法对分类数据集进行聚类

标签 python machine-learning cluster-analysis k-means

我想将 ROCK 聚类算法与基于距离的算法进行比较。假设我们有 (m) 个训练样本和 (n) 个特征

摇滚:

据我了解,ROCK 的作用是

1. It calculates a similarity matrix (m*m) using Jaccard cooficients.
2. Then a threshold value is provided by the user.
3. Based on the threshold value it then links the data points and the data-points that have more neighbors in common are said to be in same cluster. 
  For Example lets see the below png file,

enter image description here

   The above picture shows the similarity matrix and let threshold_value =0.2. 
4. The algorithm then computes links between the points, which flows in as below.
for A- A   (Only A exceeds the threshold value) 
    B- BCD (because bb, bc and bd exceeds the threshold value)
    C- BCD
    D- BCD
Now since B, C and D each have common neighbor of 3 they are grouped into the same cluster

Therefore we get two clusters {A}, {BCD}

基于距离的方法:

1. I take a different approach, but like ROCK even I create the similarity matrix.
2. Even I compute the initial links like,
   for A- A   (Only A exceeds the threshold value) 
       B- BCD (because bb, bc and bd exceeds the threshold value)
       C- BCD
       D- BCD
3. Now Instead of finding neighbors, I perform some mojo and find the best centroids.
4. After finding the centroid, I run the k-means clustering algorithm over the similarity matrix(m*m)
5. Since I find the centroids before hand, the time taken by the algorithm reduces by not running the k-means algorithm multiple times for randomly chosen centroids.

问题陈述:

我看到的问题是空间复杂度,因为相似度矩阵是一个(m*m)矩阵,如果m的值太大,比如100万,存储这么大的矩阵会很困难,也是由于矩阵size 欧氏距离计算需要时间。

但是,我相信在 ROCK 中,绝对不需要存储矩阵,因为当计算数据集之间的 Jaccard 系数时,可以动态构建链接。

我在蘑菇数据集的相似性矩阵上运行了基于距离的算法方法 (uci.org),输出结果与 ROCK 非常相似,对于其他一些数据集甚至更好。

问题:

1. Is my understanding of ROCK correct.
2. Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances. 
3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

谢谢:)

最佳答案

根据我的知识,随着大小的增加,集群变得非常内存密集,您将必须找到一种方法来减少数据的维度。

我不熟悉 ROCK,但之前我曾研究过聚类问题,其中我必须对数百万个文档进行聚类。

Distance Calculation Metric : levenshtein distance
Clustering Algorithm : DBSCAN

回到你的问题

Question 2 : 
Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances.

我永远不会建议构建一个大矩阵。例如,构建 100 万个单词的距离矩阵将需要 4 TB 的空间。您将必须使用某种分块技术对有些相似的文档进行分组,然后在顶部应用聚类算法。

3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

通常,计算两个单词之间的距离的时间复杂度是微不足道的,因为单词不太长。你的复杂度就是比较的次数,所以如果你有 n 个单词,那么时间复杂度就是 O(n*n)

关于python - 使用基于距离的方法对分类数据集进行聚类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30211577/

相关文章:

Python 列表理解与求和

machine-learning - Dropout 应该插入到哪里?全连接层。?卷积层。?或两者。?

python - scikit 学习在一定数量的数据上崩溃

machine-learning - 根据给定事件的持续时间,如何将人员聚集/分组在一起?

python-2.7 - Keras 卷积形状的尺寸无序(检查模型输入时出错)

寻找趋势 block 的算法

python - 如何更改数组区域中的值?

python - 如何在 python 中的第二个提示符中导航

python - 正则表达式在同一字符串上返回不一致的结果

machine-learning - TensorFlow错误: "logits and labels must be same size", WarmSpringWinds "tutorial"