R:K 均值聚类与社区检测算法(加权相关网络)- 我是否将这个问题过于复杂?

标签 r graph cluster-analysis nodes edges

我有如下所示的数据:https://imgur.com/a/1hOsFpF
第一个数据集是标准格式数据集,其中包含人员及其财务属性的列表。
第二个数据集包含这些人之间的“关系”——他们互相支付了多少,以及他们彼此欠了多少。
我有兴趣了解更多关于网络和基于图的聚类 - 但我试图更好地了解什么类型的情况需要基于网络的聚类,即我不想在不需要的地方使用图聚类(避免“方钉圆孔"类型情况)。
使用 R,首先我创建了一些假数据:

library(corrr)
 library(dplyr) 
library(igraph) 
library(visNetwork)
 library(stats)

# create first data set

Personal_Information <- data.frame(

"name" = c("John", "Jack", "Jason", "Jim", "Julian", "Jack", "Jake", "Joseph"),

"age" = c("41","33","24","66","21","66","29", "50"),

"salary" = c("50000","20000","18000","66000","77000","0","55000","40000"),

"debt" = c("10000","5000","4000","0","20000","5000","0","1000"

)


Personal_Information$age = as.numeric(Personal_Information$age)
Personal_Information$salary = as.numeric(Personal_Information$salary)
Personal_Information$debt = as.numeric(Personal_Information$debt)
create second data set
Relationship_Information <-data.frame(

"name_a" = c("John","John","John","Jack","Jack","Jack","Jason","Jason","Jim","Jim","Jim","Julian","Jake","Joseph","Joseph"),
"name_b" = c("Jack", "Jason", "Joseph", "John", "Julian","Jim","Jim", "Joseph", "Jack", "Julian", "John", "Joseph", "John", "Jim", "John"),
"how_much_they_owe_each_other" = c("10000","20000","60000","10000","40000","8000","0","50000","6000","2000","10000","10000","50000","12000","0"),
"how_much_they_paid_each_other" = c("5000","40000","120000","20000","20000","8000","0","20000","12000","0","0","0","50000","0","0")
)

Relationship_Information$how_much_they_owe_each_other = as.numeric(Relationship_Information$how_much_they_owe_each_other)
Relationship_Information$how_much_they_paid_each_other = as.numeric(Relationship_Information$how_much_they_paid_each_other)
然后,我运行了一个标准的 K-Means 聚类算法(在第一个数据集上)并绘制了结果:
# Method 1 : simple k means analysis with 2 clusters on Personal Information dataset
cl <- kmeans(Personal_Information[,c(2:4)], 2)
plot(Personal_Information, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
这就是我通常会如何处理这个问题。现在,我想看看我是否可以对此类问题使用图聚类。
首先,我创建了一个加权相关网络( http://www.sthda.com/english/articles/33-social-network-analysis/136-network-analysis-and-manipulation-using-r/ )
首先,我创建了加权相关网络(使用第一个数据集):
res.cor <- Personal_Information[, c(2:4)] %>%  
    t() %>% correlate() %>%            
    shave(upper = TRUE) %>%            
    stretch(na.rm = TRUE) %>%          
  filter(r >= 0.8)       

graph <- graph.data.frame(res.cor, directed=F)
graph <- simplify(graph)
plot(graph)
然后,我运行了图聚类算法:
#run graph clustering (also called communiy dectection) on the correlation network
 fc <- fastgreedy.community(graph)
 V(graph)$community <- fc$membership
 nodes <- data.frame(id = V(graph)$name, title = V(graph)$name, group = V(graph)$community)
 nodes <- nodes[order(nodes$id, decreasing = F),]
 edges <- get.data.frame(graph, what="edges")[1:2]

 visNetwork(nodes, edges) %>%
     visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
这似乎有效 - 但我不确定这是否是解决这个问题的最佳方式。
有人可以提供一些建议吗?我把这个问题复杂化了吗?
谢谢

最佳答案

也许您可能有兴趣阅读“基于融合的社区检测方法”(https://link.springer.com/chapter/10.1007/978-3-030-44584-3_24)。这些基于融合的方法显然是专门设计来考虑节点属性的。
这也可能有帮助:https://www.nature.com/articles/srep30750

关于R:K 均值聚类与社区检测算法(加权相关网络)- 我是否将这个问题过于复杂?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64849921/

相关文章:

r - 当填充值全部为 NA : Error in seq. default(h[1], h[2], length.out = n) 时,ggplot2 错误:

r - R中有并行矩阵求逆的包吗

arrays - 这是 DBSCAN 算法的预期行为吗(两个相同的数据样本不适契约(Contract)一簇)?

python - Pandas 数据框每两行的组合

python - R 与 scikit-learn 中线性回归 R2 的交叉验证

regex - 返回匹配 : regexp supported? 的逻辑向量

algorithm - O(N) 中的哈密顿循环

optimization - 用于搜索文件名并获取其路径的数据结构

algorithm - 图 : find a sink in less than O(|V|) - or show it can't be done

machine-learning - ELKI 层次聚类 - "mrg_"Cluster 对象