python:将(字符串)集列表转换为 scipy csr_matrix

标签 python scipy sparse-matrix

假设我有这个集合列表:

db = [{"bread", "butter", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"bread", "cheese", "eggs", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"cheese", "milk", "yogurt"}]

如何将其转换为 scipy 稀疏 csr_matrix?它的预期输出如下:

[[1., 1. 0., 0., 1., 0.],
 [0., 0., 0., 1., 1., 1.],
 [1., 0., 1., 1., 1., 0.],
 [0., 0., 0., 1., 1., 1.],
 [0., 0., 1., 0., 1., 1.]]

我尝试对其进行硬编码以便进一步消化它,但我似乎无法理解。我的代码是:

indptr = np.array([0, 3, 6, 10, 13, 16])
data = np.array(["bread", "butter", "milk", "eggs", "milk", "yogurt",
                "bread", "cheese", "eggs", "milk","eggs", "milk", "yogurt",
                "cheese", "milk", "yogurt"])
indices = np.array([0, 1, 4, 3, 4, 5, 0, 2, 3, 4, 3, 4, 5, 2, 4, 5])
csr_matrix((data, indices, indptr), dtype=int).toarray()

我似乎无法让它发挥作用。有没有更好的实现方法?

最佳答案

设置:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

db = [{"bread", "butter", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"bread", "cheese", "eggs", "milk"}, 
      {"eggs", "milk", "yogurt"},
      {"cheese", "milk", "yogurt"}]

all_products = set()
for SET in db:
    all_products |= SET
sorted_products = sorted(all_products)

方法 2(无 Pandas ):

首先,你做翻译

d = dict()
for i, prod in enumerate(sorted_products):
    d[prod] = i
{'bread': 0, 'butter': 1, 'cheese': 2, 'eggs': 3, 'milk': 4, 'yogurt': 5}

然后,你制作完整的矩阵并填充它

template = np.zeros(len(all_products) * len(db), dtype=int).reshape((len(db), len(all_products)))
for j, line in enumerate(db):
    for prod in line:
        template[j, d[prod]] = 1
array([[1, 1, 0, 0, 1, 0],
       [0, 0, 0, 1, 1, 1],
       [1, 0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1],
       [0, 0, 1, 0, 1, 1]])

最后将其转化为稀疏矩阵

matrix = csr_matrix(template)
  (0, 0)    1
  (0, 1)    1
  (0, 4)    1
  (1, 3)    1
  (1, 4)    1
  (1, 5)    1
  (2, 0)    1
  (2, 2)    1
  (2, 3)    1
  (2, 4)    1
  (3, 3)    1
  (3, 4)    1
  (3, 5)    1
  (4, 2)    1
  (4, 4)    1
  (4, 5)    1

#<5x6 sparse matrix of type '<class 'numpy.longlong'>'
#   with 16 stored elements in Compressed Sparse Row format>

方法一( Pandas ):

df = pd.DataFrame(index=sorted_products, columns=range(len(db)))
print(df)

给你空数据框

          0       1       2       3       4
yogurt  NaN     NaN     NaN     NaN     NaN
butter  NaN     NaN     NaN     NaN     NaN
bread   NaN     NaN     NaN     NaN     NaN
milk    NaN     NaN     NaN     NaN     NaN
cheese  NaN     NaN     NaN     NaN     NaN
eggs    NaN     NaN     NaN     NaN     NaN

然后你添加集合

for i in range(len(db)):
    df[i] = pd.Series([1]*len(db[i]), index=list(db[i]))
          0       1       2       3       4
yogurt  NaN     1.0     NaN     1.0     1.0
butter  1.0     NaN     NaN     NaN     NaN
bread   1.0     NaN     1.0     NaN     NaN
milk    1.0     1.0     1.0     1.0     1.0
cheese  NaN     NaN     1.0     NaN     1.0
eggs    NaN     1.0     1.0     1.0     NaN

接下来,您用零填充 NaN 值

data = df.fillna(0) 

最后将其转换为稀疏矩阵

from scipy.sparse import csr_matrix
matrix = csr_matrix(data)
print(matrix)

输出:

#<6x5 sparse matrix of type '<class 'numpy.longlong'>'
#   with 16 stored elements in Compressed Sparse Row format>
  (0, 2)    1
  (0, 4)    1
  (1, 1)    1
  (1, 2)    1
  (1, 3)    1
  (2, 0)    1
  (2, 1)    1
  (2, 2)    1
  (2, 3)    1
  (2, 4)    1
  (3, 1)    1
  (3, 3)    1
  (3, 4)    1
  (4, 0)    1
  (4, 2)    1
  (5, 0)    1

关于python:将(字符串)集列表转换为 scipy csr_matrix,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64077947/

相关文章:

python - SciPy 中的截断多元正态分布?

python - 数据集的傅立叶平滑

python - Numpy:智能矩阵乘法到稀疏结果矩阵

python - BeautifulSoup 返回关闭标签而不是标签文本

python - Pygame 运行没有任何问题,但屏幕上没有绘制任何内容

python - 如何将 Dask Dataframe 转换为 Dask Array?

python - 在 django ORM 中获取注释内的列表

python - Python 中的滤波器设计和频率提取

python - Python中稀疏LIL矩阵中的求和行运算极其缓慢

multithreading - CPU上最快的多线程迭代式稀疏求解器?