尝试调用 sklearn.metrics.silhouette_samples 时出现内存错误.我的用例与此相同 tutorial .我在 Python 3.5 中使用 scikit-learn 0.18.1。
相关功能,silhouette_score , 这个post建议使用 sample_size 参数,在调用 silhouette_samples 之前减少样本大小。我不确定下采样是否仍会产生可靠的结果,所以我犹豫是否这样做。
我的输入 X 是一个 [107545 行 x 12 列] 数据框,虽然我只有 8GB RAM,但我认为它并不大
sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
167 check_number_of_labels(len(le.classes_), X.shape[0])
168
--> 169 distances = pairwise_distances(X, metric=metric, **kwds)
170 unique_labels = le.classes_
171 n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245 func = partial(distance.cdist, metric=metric, **kwds)
1246
-> 1247 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248
1249
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088 if n_jobs == 1:
1089 # Special case to avoid picklability checks in delayed
-> 1090 return func(X, Y, **kwds)
1091
1092 # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244 YY = row_norms(Y, squared=True)[np.newaxis, :]
245
--> 246 distances = safe_sparse_dot(X, Y.T, dense_output=True)
247 distances *= -2
248 distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
计算好像是靠euclidean_distances在 np.dot 的调用中崩溃了.我在这里不是在处理稀缺问题,所以也许没有解决方案。计算距离时我通常使用 numpy.linalg.norm (A-B)。这是否具有更好的内存处理能力?
最佳答案
更新:PR 11135应该在 scikit-learn 中解决这个问题,使帖子的其余部分过时。
你有大约 100000 = 1e5 个样本,它们是 12 维空间中的点。 pairwise_distances
方法试图计算它们之间的所有成对距离。即 (1e5)**2 = 1e10 距离。每个都是一个 float ; float64 格式占用 8 个字节的内存。所以距离矩阵的大小是 8e10 字节,也就是 74.5 GB。
这偶尔会在 GitHub 上报告:#4701 , #4197答案很粗略:这是一个 NumPy 问题,它无法处理具有该大小矩阵的 np.dot
。虽然有 one comment说
it might be possible to break this up into sub-matrices to do the calculation more memory efficient.
事实上,如果不是在开始时形成一个巨大的距离矩阵,该方法会在 the loop over labels 中计算它的相关 block 。 ,这将需要更少的内存。
使用它的 source 修改方法并不难所以它不是先计算距离然后再应用二进制掩码,而是先掩码。这就是我在下面所做的。它不需要 N**2
内存,其中 N 是样本数,它需要 n**2
,其中 n 是最大簇大小。
如果这看起来很实用,我想它可以通过一些标志添加到 Scikit 中……应该注意这个版本不支持 metric='precomputed'
,但是.
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])
unique_labels = le.classes_
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf + intra_clust_dists
for curr_label in range(len(unique_labels)):
# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
# Leave out current sample.
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab
# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
other_distances = np.mean(inter_distances, axis=1)
inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
关于python - 来自 sklearn.metrics.silhouette_samples 的内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47702750/