python - 有向图的最大强连通分量

我正在研究 Networkx .MultiDiGraph()从总共 82927 个定向电子邮件数据构建的对象。在当前阶段，我正在尝试从 .MultiDiGraph() 中获取最大的强连接组件。对象及其对应的子图。可以访问文本数据here . 这是我的工作代码:

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '->')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))

# G is a .MultiDiGraph object
# using .strongly_connected_components() to get the part of G that has the most nodes
# using list comprehension
number_of_nodes = [len(n) for n in sorted(nx.strongly_connected_components(G))]
number_of_nodes

# 'number_of_nodes' return a list of [1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network)

# using the recommended method in networkx documentation

largest = max(nx.strongly_connected_components(G), key=len)
largest

# 'largest' returns {92}, not sure what this means...

正如我在上面的代码块中指出的那样，列表理解方法返回一个长度为 167(这是我的数据中的节点总数)的 [1, 1, 1,..., 1] 的列表，而max(nx.strongly_connected_components(G), key=len)返回 {92} ，我不确定这是什么意思。

看来我的代码有问题，我可能错过了处理数据的几个关键步骤。有谁愿意看一看并启发我吗？

谢谢。

注意:修改后的代码(感谢 Eric 和 Joel)

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '	')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
# per @Joel's comment, adding 'create_using = nx.DiGraph()'     
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight', create_using = nx.DiGraph())
# adding this 'directed' edge list to .MultiDiGraph() object

G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))

我们现在检查该网络中最大的强连通组件(根据节点数量)。

In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: len(largest)
Out [2]: 126

最大的强连通组件由 126 个节点组成。

[更新] 经过进一步的尝试和错误，我发现需要使用 create_using = .MultiDiGraph() (而不是 .DiGraph() )将数据加载到 networkx 上时，否则，即使您为 MultiDiGraph 获得了正确数量的节点及其弱/强连接子图，您可能仍然会弄错边数!这会反射(reflect)在你身上.strongly_connected_subgraphs()输出。

对于我这里的情况，我会推荐其他人使用这个单行

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
G = nx.read_edgelist(path="email_network.txt", data=[('time', int)], create_using=nx.MultiDiGraph(), nodetype=str)

我们可以实现.strongly_connected_components(G)和 strongly_connected_subgraphs进行验证。

如果您使用 networkx输出 G从第一个代码块开始，max(nx.strongly_connected_components(G), key=len)将给出一个包含 126 个节点和 52xx 个边缘的输出，但是如果你应用我上面列出的单行代码，你将得到:

In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: G_sc = max(nx.strongly_connected_subgraphs(G), key=len)

In [3]: nx.number_of_nodes(G_sc) 
Out [3]: 126
In [4]: nx.number_of_nodes(G_sc) 
Out [4]: 82130

由于与不同的 networkx 相关联的计数机制不同，您将使用这两种方法获得相同数量的节点，但边缘数量不同图类。

最佳答案

错误的根本原因是 nx.from_pandas_dataframe 默认创建无向图。所以 email 是一个无向图。当您随后创建有向图时，每条边仅出现在一个方向上。

要修复它，请使用带有参数 create_using = DiGraph 的 nx.from_pandas_dataframe

与您得到的输出相关的旧评论

所有强连接组件都有一个节点。

当您执行 max(nx.strongly_connected_components(G), key=len) 时，它会找到长度最长的节点集并将其返回。在你的例子中，它们的长度都是 1，所以它返回其中一个(我相信无论哪个 networkx 碰巧首先放入 nx.strongly_connected_components(G))。但它返回的是集合，而不是长度。所以 {92} 是它返回的节点集。

碰巧 {92} 被决胜局选为 nx.strongly_connected_components(G) 中“最长”长度为 1 的组件。

例子:

max([{1}, {3}, {5}], key = len)
> {1}

关于python - 有向图的最大强连通分量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46316874/

python - 有向图的最大强连通分量

上一篇：python - Pandas :每月日期范围与天数

下一篇：python - Django - {% static %} 和/static/有什么不同？