statistics - 使用 PCA 降维后对数据进行聚类

标签 statistics machine-learning cluster-analysis pca

假设我们有一个大维度的数据集，我们使用 PCA 将其减少到较低的维度，那么对所述数据使用聚类算法是否明智/准确？假设我们不知道预期有多少个集群。

在 Iris 数据集上使用 PCA(CSV 中的数据按顺序排列，列出所有第一类，然后是第二类，然后是第三类)会产生以下图:- Ordered data run through PCA

可以看到，Iris数据集中的三个类都被保留了。然而，当样本的顺序随机化时，会产生以下图:- Unordered data run thorough PCA

上面，尚不清楚数据集中包含多少个簇/类。在这种情况下(更现实的情况)，如何确定类的数量，K-Means 等聚类算法是否有效？

由于丢弃低阶主成分，是否会出现错误？

编辑:- 明确地说，我问的是运行 PCA 后是否可以对数据集进行聚类，如果可以，最准确的方法是什么。

最佳答案

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.

您的数据可能会在低方差维度中很好地分离。我不建议在聚类之前运行 PCA。

Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?

有一些有效的聚类算法不需要先了解类的数量，例如 Mean Shift 和 DBSCAN。

关于statistics - 使用 PCA 降维后对数据进行聚类，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19002810/

上一篇：solr - 如何使用solr中的信息增益来计算术语分数？

下一篇：memory - 使用内存存储时如何在重新启动之间维护状态 - Mahout

相关文章：

r - 聚类分析(层次)中如何了解群体信息？

java - 大数据集的短文本聚类 - 用户分析

math - 如何计算 f(2) 度量？

r - xtable 用于不支持的功能(使用 R)

algorithm - 在 "voter"遗传算法中选择幸存种群

string - 机器学习根据字符串相似度将字符串预处理为数字

python - 统计:优化python中的概率计算

machine-learning - 使用朴素贝叶斯分类器进行文档分类

python - 如何在tensorflow 2中获得损失梯度wrt内层输出？

matlab - 在 MATLAB 中围绕数据点组绘制多边形