python - 如何进行有效的矩阵计算而不导致相似性评分的内存过载？

我有以下用于相似性评分的代码:

from rapidfuzz import process, fuzz
import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

我想在 name 列中识别相似的名称(如果这些名称属于一个簇号)，并为它们创建唯一的 ID。例如，South Beach 和 Beach 属于聚类号 1，它们的相似度得分相当高。因此我们将它与唯一的 ID 关联起来，比如 1。下一个簇是编号 2，name 列中的三个实体属于该簇:Dog、Big Dog 和 猫。 Dog 和 Big Dog 具有很高的相似度得分，它们的唯一 ID 将是，例如 2。对于 Cat 来说，唯一的 ID 是 3。等等。

代码生成预期结果:

    name        cluster_number id
0   South Beach 1              1
1   Dog         2              2
2   Bird        3              3
3   Ant         3              4
4   Big Dog     2              2
5   Beach       1              1
6   Dear        4              5
7   Cat         2              6

上面的代码代表了相似性评分的高效矢量化方法。它非常适合小型数据集，但当我尝试包含 100 万行的数据帧时，我收到函数 rapidfuzz.process.cdist(...) 的 memoryError 错误。正如下面的评论部分所述，该函数返回 len(queries) x len(choices) x size(dtype) 的矩阵。默认情况下，此数据类型是 float 或 int32_t，具体取决于记分器(对于您使用的默认记分器，它是 float)。因此，对于 100 万个名字，结果矩阵将需要大约 4 TB 的内存。我的电脑有 12GB 可用 RAM 空间，但还不够。有什么想法可以避免 RAM 过载但保持矢量化形式的计算吗？

对于@J.M.Arnold 解决方案(包括他的评论)，代码可以重写为:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
names = df_test["name"]
def calculate_similarity_matrix(names):
    scores = pd.DataFrame(process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores
chunks = np.array_split(names, 1000)
_ = []
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    _.append(matrix)
finished = pd.concat(_)
x, y = np.where(finished > 50)
groups = (pd.DataFrame(finished.index[x], finished.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'id'}, inplace=True)
groups.id+= 1
df_test = df_test.merge(groups, how="left")

但它不会生成正确的结果:

          name  cluster_number             id
0        Beach               1              2
1  South Beach               1              8
2      Big Dog               2              3
3          Cat               2              5
4          Dog               2              7
5          Ant               3              1
6         Bird               3              4
7         Dear               4              6

请注意，例如Dog 和 Big Dog 具有不同的 id，但它们应该具有相同的。

最佳答案

如maxbachmann在 your GitHub issue 中说道这都是关于默认类型的:

default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float)

如果您查看 rapidfuzz.process.dist 的文档您可以看到数据类型指定如下:

similarity: - np.float32, np.float64 - np.uint8 -> stores fixed point representation of the result scaled to a range 0-100

distance: - np.int8, np.int16, np.int32, np.int64

If not given, then the type will be np.float32 for similarities and np.int32 for distances.

您可以通过len(queries) x len(choices) x size(dtype)计算矩阵的大小，对于您当前的实现来说是1百万x 1百万x 8字节(对于 float - 这是您正在使用的记分器的默认值)。大约是7.6TB! (即使对于 4 个字节的 int32 - 正如 Max Bachmann 提到的)，您最终也会需要 3.8 TB 的所需空间。

避免问题的一个选项是减小数据类型的大小 - 例如使用 int8 和 1 个字节。显然，您的相似度分数的准确度会明显降低，因为值范围为 -128 到 127!使用上述公式，您可以将大小减小到 ~950GB!

另一种方法(从长远来看可能是唯一可行的方法)是拆分数据并以较小的 block 进行处理 - as Max Bachmann suggested .

定义一个函数来处理矩阵相似度分数的计算。 (类似于您的代码)
将姓名列表分成更小的部分。
迭代 block 并存储每个步骤的相似度矩阵。
将结果连接成一个大矩阵。

import numpy as np

# Step 1
def calculate_similarity_matrix(names):
    # Do your part, e.g. processing and so forth. But after all, return the similarity matrix for "names"
    scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
    return scores

# Step 2
# Split the names list into chunks - e.g. in portions of 1000 names each
chunks = np.array_split(names, 1000)

# Step 3
# Iterate over the names and store the matrix on the disk
for i, chunk in enumerate(chunks):
    matrix = calculate_similarity_matrix(chunk)
    matrix.to_pickle(f"matrix_{i}.pkl")

# Step 4
# Read the matrices
matrices = [pd.read_pickle(f"matrix_{i}.pkl") for i in range(len(chunks))]
# Concatenate
finished = pd.concat(matrices)

之后，您将在完成中获得完整的计算相似度矩阵!

这种方法将允许您处理更大的数据集，而不会耗尽内存/内存过载(正如您的问题所问)!这是因为矩阵在迭代之间存储在磁盘上。

但是，我的方法肯定会更慢(与一次处理所有数据相比 - 这是不可能的，除非您有 3TB 以上的 RAM)，因为您需要读写磁盘 1,000 次。

显然，您可以调整正在使用的 block 数量。在我当前的方法中，您有 1,000 个 block ，每个 block 有 1,000 个名称。根据我们上面的公式，每个步骤(float 为 8 字节)仅需要 8MB 的 RAM。您可以尝试并调整最适合您的硬件!

关于python - 如何进行有效的矩阵计算而不导致相似性评分的内存过载？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74780473/

python - 如何进行有效的矩阵计算而不导致相似性评分的内存过载？

上一篇：c++ - 函数输入参数的 auto 是否可以替代函数模板？

下一篇：android - 如何使用 Hilt 将 ViewModel 注入(inject)到 Activity 中？