python - 从相似度矩阵创建 NetworkX 图

标签 python pandas dataframe networkx cosine-similarity

我是图表世界的新手,希望得到一些帮助:-)

我有一个包含 10 个句子的数据框,我计算了每个句子之间的余弦相似度。

原始数据框:

    text
0   i like working with text    
1   my favourite colour is blue and i like beans
2   i have a cat and a dog that are both chubby Pets
3   reading is also working with text just in anot...
4   cooking is great and i love making beans with ...
5   my cat likes cheese and my dog likes beans
6   in some way text is a bit boring
7   cooking is stressful when it is too complicated
8   pets can be so cute but they are often a lot o...
9   working with pets would be a dream job

计算余弦相似度:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

k = test_df['text'].tolist()

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(k) 

# Calculate the pairwise cosine similarities 
S = cosine_similarity(X)

# add output to new dataframe 
print(len(S))
T  = S.tolist()
df = pd.DataFrame.from_records(T)

余弦相似度的输出:

    0   1   2   3   4   5   6   7   8   9
0   1.000000    0.204491    0.000000    0.378416    0.110185    0.000000    0.158842    0.000000    0.000000    0.282177
1   0.204491    1.000000    0.072468    0.055438    0.333815    0.327299    0.064935    0.112483    0.000000    0.000000
2   0.000000    0.072468    1.000000    0.000000    0.064540    0.231068    0.000000    0.000000    0.084140    0.000000
3   0.378416    0.055438    0.000000    1.000000    0.110590    0.000000    0.375107    0.097456    0.000000    0.156774
4   0.110185    0.333815    0.064540    0.110590    1.000000    0.205005    0.057830    0.202825    0.000000    0.071145
5   0.000000    0.327299    0.231068    0.000000    0.205005    1.000000    0.000000    0.000000    0.000000    0.000000
6   0.158842    0.064935    0.000000    0.375107    0.057830    0.000000    1.000000    0.114151    0.000000    0.000000
7   0.000000    0.112483    0.000000    0.097456    0.202825    0.000000    0.114151    1.000000    0.000000    0.000000
8   0.000000    0.000000    0.084140    0.000000    0.000000    0.000000    0.000000    0.000000    1.000000    0.185502
9   0.282177    0.000000    0.000000    0.156774    0.071145    0.000000    0.000000    0.000000    0.185502    1.000000

我现在想从两个数据帧创建一个图表,其中我的节点是通过余弦 smiliarty(边)连接的句子。我已经添加了节点,如下所示,但我不确定如何添加边缘?

### Build graph
G = nx.Graph()

# Add node
G.add_nodes_from(test_df['text'].tolist())


# Add edges 
G.add_edges_from()
 

最佳答案

您可以将 df 中的索引和列名称设置为输入数据帧(网络中的节点)中的 text 列,并从中构建一个图表作为使用 nx.from_pandas_adjacency 的邻接矩阵:

df_adj = pd.DataFrame(df.to_numpy(), index=test_df['text'], columns=test_df['text'])
G = nx.from_pandas_adjacency(df_adj)

G.edges(data=True)
EdgeDataView([('i like working with text    ', 'i like working with text    ', {'weight': 1.0}), 
              ('i like working with text    ', 'my favourite colour is blue and i like beans', {'weight': 0.19953178577876396}),
              ('i like working with text    ', 'reading is also working with text just in anot...', {'weight': 0.39853956570404026})
              ...

关于python - 从相似度矩阵创建 NetworkX 图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64027427/

相关文章:

python - 在 Pandas 中转置列和排名值

python - 如何使用 Pandas 改进我的具有可迭代最大值的新列的庞大代码?

python - 在字典中为一个键存储多个值

python - 一对多加入 pandas 数据帧作为 JSON 而不是 pandas 数据帧

python - 在 pandas DataFrame 的每一列中找到第一个非零值

python - 将 csv 文件发送到 fastAPI 并取回新文件

python - 将包含以逗号分隔的值的 CSV 文件转换为多列 CSV 文件

python - DateTimeField 设置为默认字段

python - 在 Dataframe.assign() 中使用 if/else 会导致 ValueError : The truth value of a Series

R - 测试值是否与上面单元格中的值相同