r - K-均值算法:Lloyd、Forgy、MacQueen、Hartigan-Wong

我正在使用 R 中的 K-Means 算法，我想找出 4 种算法 Lloyd、Forgy、MacQueen 和 Hartigan-Wong 的区别，这些算法可用于 stats 包中的函数“kmeans”。

但是值得注意的是，我对这个问题得到了充分的回答。

我只找到了一些很少见的信息: (访问http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means)

从这个描述来看，Lloyd、Forgy 和 Hartigan-Wong 在我看来是一样的。最小化内平方和和最小化欧氏距离是一样的。

MacQueen 的不同之处在于，如果我是对的，如果一个对象被移动到另一个集群，它会更新两个相关的集群。

尽管如此，我仍然看不出这些算法在哪些方面不同。

最佳答案

R 提供 Lloyd 算法作为 kmeans() 的选项；默认算法，通过 Hartigan 和 Wong(1979)要聪明得多。像 MacQueen 的算法 (MacQueen, 1967)，每当移动一个点时，它都会更新质心；它还可以做出聪明(节省时间)的选择在检查最近的集群。另一方面，Lloyd 的 k-means 算法是所有这些聚类算法中第一个也是最简单的。

Lloyd 算法(Lloyd，1957 年)采用一组观察或案例(思考:行 nxp 矩阵，或实数中的点)并将它们聚类到 k 组中。它试图最小化簇内平方和 enter image description here

其中 u_i 是集群 S_i 中所有点的平均值。该算法按如下方式进行(我将免去详尽符号的形式): enter image description here

然而，R 的实现存在问题，问题出现在考虑多个起点。我应该注意，考虑通常是谨慎的几个不同的起点，因为算法保证收敛，但不是保证覆盖到全局最优。对于大型、高维的情况尤其如此问题。我将从一个简单的示例开始(大，不是特别困难)。

(这里我会贴一些图片，因为我们不能用latex写数学公式)

enter image description here

请注意，该解决方案与之前实现的解决方案非常相似，尽管集群是任意的。更重要的是，这项工作并行只用了 0.199 秒!一定这好得令人难以置信:使用 3 个处理器内核最多只能占用三分之一的资源我们第一次(连续)运行的时间。这是个问题吗？这听起来像是免费的午餐。没有偶尔有免费午餐的问题，是吗？

enter image description here

这并不总是适用于 R 函数，但有时我们有机会直接查看在代码处。这是那些时代之一。我要把这段代码放到文件 mykmeans.R 中，并手动编辑它，在不同的地方插入 cat() 语句。这是一个聪明的方法这个，使用 sink()(虽然这在 Sweave 中似乎不起作用，但它可以交互工作):

> sink("mykmeans.R")
> kmeans
> sink()

现在编辑文件，更改函数名称并添加 cat() 语句。笔记您还必须删除尾随行::

enter image description here

然后我们可以重复我们的探索，但是使用 mykmeans():

> source("mykmeans.R")
> start.kmeans <- proc.time()[3]
> ans.kmeans <- mykmeans(x, 4, nstart = 3, iter.max = 10, algorithm = "Lloyd")
JJJ statement 1: 0 elapsed time.
JJJ statement 5: 2.424 elapsed time.
JJJ statement 6: 2.425 elapsed time.
JJJ statement 7: 2.52 elapsed time.
JJJ statement 6: 2.52 elapsed time.
JJJ statement 7: 2.563 elapsed time.

enter image description here

现在我们开始做生意了:大部分时间都花在了语句 5 之前(我知道这是当然，这就是为什么语句 5 是 5 而不是 2)... 你可以继续玩它

代码如下:

#######################################################################
# kmeans()

N <- 100000
x <- matrix(0, N, 2)
x[seq(1,N,by=4),] <- rnorm(N/2)
x[seq(2,N,by=4),] <- rnorm(N/2, 3, 1)
x[seq(3,N,by=4),] <- rnorm(N/2, -3, 1)
x[seq(4,N,by=4),1] <- rnorm(N/4, 2, 1)
x[seq(4,N,by=4),2] <- rnorm(N/4, -2.5, 1)
start.kmeans <- proc.time()[3]
ans.kmeans <- kmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

these <- sample(1:nrow(x), 10000)
plot(x[these,1], x[these,2], pch=".")
points(ans.kmeans$centers, pch=19, cex=2, col=1:4)

library(foreach)
library(doMC)
registerDoMC(3)
start.kmeans <- proc.time()[3]
ans.kmeans.par <- foreach(i=1:3) %dopar% {
  return(kmeans(x, 4, nstart=1, iter.max=10, algorithm="Lloyd"))
}
TSS <- sapply(ans.kmeans.par, function(a) return(sum(a$withinss)))
ans.kmeans.par <- ans.kmeans.par[[which.min(TSS)]]
ans.kmeans.par$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

sink("mykmeans.Rfake")
kmeans
sink()

source("mykmeans.R")
start.kmeans <- proc.time()[3]
ans.kmeans <- mykmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

#######################################################################
# Diving

x <- read.csv("Diving2000.csv", header=TRUE, as.is=TRUE)
library(YaleToolkit)
whatis(x)

x[1:14,c(3,6:9)]

meancol <- function(scores) {
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}
x$panelmean <- meancol(x$JScore)

x[1:14,c(3,6:9,11)]

meancol <- function(scores) {
  browser()
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}

x$panelmean <- meancol(x$JScore)

这里是描述:

Number of cases: 10,787 scores from 1,541 dives (7 judges score each
dive) performed in four events at the 2000 Olympic Games in Sydney,
Australia.

Number of variables: 10.

Description: A full description and analysis is available in an
article in The American Statistician (publication details to be
announced).

Variables:

Event       Four events, men's and women's 3M and 10m.
Round       Preliminary, semifinal, and final rounds.
Diver       The name of the diver.
Country     The country of the diver.
Rank        The final rank of the diver in the event.
DiveNo      The number of the dive in sequence within round.
Difficulty  The degree of difficulty of the dive.
JScore      The score provided for the judge on this dive.
Judge       The name of the judge.
JCountry    The country of the judge.

和用于试验的数据集 https://www.dropbox.com/s/urgzagv0a22114n/Diving2000.csv

关于r - K-均值算法:Lloyd、Forgy、MacQueen、Hartigan-Wong，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20446053/

r - K-均值算法:Lloyd、Forgy、MacQueen、Hartigan-Wong

上一篇：performance - 降低该算法的时间复杂度

下一篇：algorithm - 如何在集群中运行的节点中选举主节点？