python - 如何用Python创建150万用户的友谊矩阵？

我的任务是创建一个友谊矩阵(用户-用户矩阵)，如果用户是 friend ，则值为 1，如果不是，则值为 0。我的 .csv 文件有 150 万行，因此我创建了以下小 csv 来测试我的算法:

user_id              friends
   Elena          Peter, John
   Peter          Elena, John
   John           Elena, Peter, Chris
   Chris          John

对于这个小 csv，我的代码运行良好:

%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np

from scipy import sparse

sns.set(style="darkgrid")

user_filepath = 'H:\\YelpData\\test.csv' # this is my little test file

df = pd.read_csv(user_filepath, usecols=['user_id','friends'])

def Convert_String_To_List(string):
    if string!="None":
        li = list(string.split(", ")) 
    else:
        li = []
    return li 

friend_map = {}

for i in range(len(df)): #storing friendships in map
    friend_map[df['user_id'][i]] = Convert_String_To_List(df['friends'][i])

users = sorted(friend_map.keys()) 
user_indices = dict(zip(users, range(len(users)))) #giving indices for users

#and now the sparsity matrix:

row_ind = [] #row indices, where the value is 1
col_ind = [] #col indices, where the value is 1
data = []    # value 1

for user in users:
    for barat in baratok[user]:
        row_ind.append(user_indices[user])
        col_ind.append(user_indices[barat])

for i in range(len(row_ind)):
    data.append(1)

mat_coo = sparse.coo_matrix((data, (row_ind, col_ind)))

friend_matrix = mat_coo.toarray() #this friendship matrix is good for the little csv file

但是当我在大型(150 万行)csv 中尝试此代码时，当我想在 map 中(在 for 循环中)存储友谊时，出现内存错误。

有什么解决办法吗？

最佳答案

我认为您的处理方式是错误的，您应该尽可能使用pandas和矢量化操作来考虑您拥有的大数据。

这是一个完整的 pandas 方法，具体取决于您的数据。

import pandas as pd

_series = df1.friends.apply(lambda x: pd.Series(x.split(', '))).unstack().dropna()
data = pd.Series(_series.values, index=_series.index.droplevel(0))
pd.get_dummies(data).groupby('user_id').sum()

输出

        Chris   Elena   John    Peter
user_id             
Chris   0          0    1        0
Elena   0          0    1        1
John    1          1    0        1
Peter   0          1    1        0

顺便说一句，这可以进一步优化，通过使用 pandas 您可以避免使用内存昂贵的 for 循环，并且可以使用 chunksize 对数据进行分块以进一步优化。

关于python - 如何用Python创建150万用户的友谊矩阵？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52945678/

python - 如何用Python创建150万用户的友谊矩阵？

上一篇：Python。在 Seaborn Facetgrid 上使用两个 y 轴绘制折线图和条形图

下一篇：python - Python 请求中的 PDF