python - 不确定 sklearn 中的 PCA

标签 python scikit-learn pca

我需要使用 sklearn 进行一些 PCA,并且我想确保我以正确的方式进行。这是我的代码:

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca_result = pca.fit_transform(data)

eigenvalues = pca.singular_values_
print(eigenvalues)

x = pca_result[:,0]
y = pca_result[:,1]

数据如下所示:

[[ -6.4186, -14.3534,  18.1296,  -2.8110,  14.0298],
[ -7.1220, -17.1501,  21.2807,  -3.5025,  16.4489],
[ -8.4652, -18.9316,  25.0303,  -4.1773,  18.5066],
...,
[ -4.7054,   6.1389,   3.5146,  -0.1036,  -0.7332],
[ -5.8533,   9.9087,   4.1178,  -0.5211,  -2.2415],
[ -6.2969,  13.8951,   3.4365,  -0.9207,  -4.2024]]

这些是特征值:[1005.2761、853.5491、65.058365、49.994457、10.277865]。我不太确定最后两行。我想绘制在 2D 空间中投影的数据,这似乎弥补了数据中的大部分变化(基本上是 5D 数据的 2D 绘图,因为它看起来像是存在于 2D 流形上)。我做对了吗?谢谢!

最佳答案

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components

Such dimensionality reduction can be a very useful step for visualising and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting L = 2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable.

https://en.wikipedia.org/wiki/Principal_component_analysis

所以你需要运行:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)

x = pca_result[:,0]
y = pca_result[:,1]

然后你就有了一个二维空间。

关于python - 不确定 sklearn 中的 PCA,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59242252/

相关文章:

R:knn + pca,选择未定义的列

python 绘制特征向量

Python Mysqldb更新操作带变量

python - 迭代卡住集与集合的内存差异

python - Gitlab CI/CD 部署 Django Docker 容器

machine-learning - 在PCA之前使用哪种特征缩放方法?

apache-spark - PCA 输入错误参数超过 65535

Python - 计算 IP 范围内 IP 的最佳方法

algorithm - Adaboost在神经网络上的实现

python - CSR 格式的 scipy.sparse 矩阵是什么?