我在这里搜索并用谷歌搜索,但无济于事。当在 Weka 中进行聚类时,有一个方便的选项,类到聚类,它与算法产生的聚类相匹配,例如简单的 k-means,到您作为类属性提供的“基本事实”类标签。这样我们就可以看到聚类的准确性(不正确的百分比)。
现在,我如何在 Matlab 中实现这一点,即翻译我的 clusterClasses
向量,例如[1, 1, 2, 1, 3, 2, 3, 1, 1, 1]
与提供的地面实况标签向量相同的索引,例如[2, 2, 2, 3, 1, 3]
?
我想大概是基于聚类中心和标签中心,但我不确定如何实现!
如有任何帮助,我们将不胜感激。
文森特
最佳答案
几个月前,我在做集群时偶然发现了一个类似的问题。我并没有长时间搜索内置解决方案(尽管我确信它们一定存在)并最终编写了我自己的小脚本来将我找到的标签与基本事实进行最佳匹配。代码非常粗糙,但应该可以帮助您入门。
它基于尝试对标签进行所有可能的重新排列以查看最适合真值向量的方法。这意味着给定一个聚类结果 yte = [3 3 2 1]
和 ground truth y = [1 1 2 3]
,脚本将尝试匹配 [3 3 2 1]、[3 3 1 2]、[2 2 3 1]、[2 2 1 3]、[1 1 2 3] 和 [1 1 3 2]
与 y
找到最佳匹配。
这是基于使用内置脚本 perms()
女巫不能处理超过 10 个独特的集群。对于 7-10 个独特的集群,代码也可能趋于缓慢,因为复杂性会以阶乘的形式增长。
function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels,
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%# yte - result from clustering (y-test)
%# y - truth vector
%#
%# Output:
%# accuracy - Overall accuracy for entire clustering (OA). For
%# overall error, use OE = 1 - OA.
%# true_labels - Vector giving the label rearangement witch best
%# match the truth vector (y).
%# CM - Confusion matrix. If unique(yte) = 4, produce a
%# 4x4 matrix of the number of different errors and
%# correct clusterings done.
N = length(y);
cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;
perm = perms(unique(y));
[pN pM] = size(perm);
true_labels = y;
for i=1:pN
flipped_labels = zeros(1,N);
for cl = 1 : pM
flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
end
testAcc = sum(flipped_labels == y')/N;
if testAcc > accuracy
accuracy = testAcc;
maxInd = i;
true_labels = flipped_labels;
end
end
CM = zeros(pM,pM);
for rc = 1 : pM
for cc = 1 : pM
CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
end
end
示例:
[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')
acc =
0.6667
newLabels =
1 2 2 3 2 1
CM =
1 0 0
0 2 0
1 1 1
关于matlab - 如何在 Matlab 中将簇标签与我的 'ground truth' 标签匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11683785/