r - 从群体中选择最不相似的个体的最佳方法是什么？

我尝试使用 k 均值聚类来选择群体中最多样化的标记，例如，如果我们想选择 100 条线，我将整个群体聚类为 100 个簇，然后从每个簇中选择最接近质心的标记集群。

我的解决方案的问题是花费太多时间(可能是我的函数需要优化)，尤其是当标记数量超过 100000 时。

因此，如果有人能够向我展示一种新的方法来选择标记，以最大限度地提高群体的多样性和/或帮助我优化我的功能以使其更快地工作，我将非常感激。

谢谢

# example:

library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)

这是我使用的mostdiff函数:

mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
    transposedMarkers <- as.array(markers)
    mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
    save(mrkClust, file="markerCluster.Rdata")

    # within clusters, pick the markers that are closest to the cluster centroid
    # turn the vector of which markers belong to which clusters into a list nClust long
    # each element of the list is a vector of the markers in that cluster

    clustersToList <- function(nClust, clusters) {
        vecOfCluster <- function(whichClust, clusters) {
            return(which(whichClust == clusters))
        }
        return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
    }

    pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
        clustSize <- length(vecOfCluster)
        # if there are fewer than three markers, the center is equally distant from all so don't bother
        if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])

        # figure out the distance (squared) between each marker in the cluster and the cluster center
        distToCenter <- function(marker, center){
            diff <- center - marker    
            return(sum(diff*diff))
        }

        dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
        return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]]) 
    }
}

最佳答案

如果 kmeans 是最消耗的部分，您可以将 k-means 算法应用于总体的随机子集。如果随机子集的大小与您选择的质心数量相比仍然很大，您将得到基本相同的结果。或者，您可以对多个子集运行多个 kmean 并合并结果。

另一个选择是尝试 k-medoid算法，它将给出属于总体一部分的质心，因此不需要找到每个簇中最接近其质心的成员的第二部分。不过它可能比 k-means 慢。

关于r - 从群体中选择最不相似的个体的最佳方法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19472959/

r - 从群体中选择最不相似的个体的最佳方法是什么？

上一篇：encryption - AES StreamWriter 和文件损坏 - 恢复场景？

下一篇：php - Facebook OAuth - 获取电子邮件地址