我有一个很大的文档语料库,如果它们有明显的 n 元语法重叠(在我的例子中,我正在考虑二元语法),我想要合并它们。考虑集合列表:
corpus = [{'example', 'bigram'}, {'another', 'example'}, {'some', 'outlier'}]
我有以下代码来计算相似度矩阵:
import numpy as np
sim_matrix = np.zeros((len(corpus), len(corpus)))
for i in range(len(corpus)):
for j in range(i+1, len(corpus)):
sim_matrix[i][j] = get_ngram_overlap(corpus[i], corpus[j])
np.fill_diagonal(sim_matrix, 1)
其中get_ngram_overlap
函数定义如下:
def get_ngram_overlap(ngrams_s1, ngrams_s2):
if min(len(ngrams_s1), len(ngrams_s2)) == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/min(len(ngrams_s1), len(ngrams_s2))
结果是一个 MxM 矩阵,其中 M 是我的语料库中 n 元语法的数量
print(sim_matrix)
>> [[1. 0.5 0.]
[0. 1. 0.]
[0. 0. 1.]]
问题是,当 M 变大时,代码会变得非常慢,并且计算量会变得非常昂贵,因为需要比较所有唯一对。 有没有方法可以更有效地计算它?例如使用优化的距离矩阵计算和适用于字符串集的自定义相似度度量。如有任何帮助,我们将不胜感激!
最佳答案
您可以加快计算速度,首先构建一个索引,然后仅对具有常见单词的组合运行 get_ngram_overlap()
:
import numpy as np
from itertools import combinations
# make the items in corpus frozensets (to make them hashable)
corpus = [frozenset({'example', 'bigram'}), frozenset({'another', 'example'}), frozenset({'some', 'outlier'})]
def get_ngram_overlap(ngrams_s1, ngrams_s2):
mn = min(len(ngrams_s1), len(ngrams_s2))
if mn == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/mn
index, nums = {}, {}
for i, b in enumerate(corpus):
for word in b:
index.setdefault(word, []).append(b)
nums.setdefault(b, []).append(i)
index2 = {}
for k, v in index.items():
for c in combinations(v, 2):
index2[c] = get_ngram_overlap(*c)
sim_matrix = np.zeros((len(corpus), len(corpus)))
for a, b in index2:
for x in nums[a]:
for y in nums[b]:
sim_matrix[(x, y) if x < y else (y, x)] = index2[(a, b)]
np.fill_diagonal(sim_matrix, 1)
print(sim_matrix)
打印:
[[1. 0.5 0. ]
[0. 1. 0. ]
[0. 0. 1. ]]
包含 500 个不同单词的 10k 二元组的基准:
import random
import numpy as np
from timeit import timeit
from itertools import combinations
random.seed(123)
corpus = [frozenset({f'word{random.randint(1, 500)}', f'word{random.randint(1, 500)}'}) for _ in range(10_000)]
def get_ngram_overlap(ngrams_s1, ngrams_s2):
mn = min(len(ngrams_s1), len(ngrams_s2))
if mn == 0:
return 0
common_ngrams = ngrams_s1 & ngrams_s2
return len(common_ngrams)/mn
def fn1():
sim_matrix = np.zeros((len(corpus), len(corpus)))
for i in range(len(corpus)):
for j in range(i+1, len(corpus)):
sim_matrix[i][j] = get_ngram_overlap(corpus[i], corpus[j])
np.fill_diagonal(sim_matrix, 1)
return sim_matrix
def fn2():
index, nums = {}, {}
for i, b in enumerate(corpus):
for word in b:
index.setdefault(word, []).append(b)
nums.setdefault(b, []).append(i)
index2 = {}
for k, v in index.items():
for c in combinations(v, 2):
index2[c] = get_ngram_overlap(*c)
sim_matrix = np.zeros((len(corpus), len(corpus)))
for a, b in index2:
for x in nums[a]:
for y in nums[b]:
sim_matrix[(x, y) if x < y else (y, x)] = index2[(a, b)]
np.fill_diagonal(sim_matrix, 1)
return sim_matrix
assert np.array_equal(fn1(), fn2())
t1 = timeit(fn1, number=1)
t2 = timeit(fn2, number=1)
print(t1)
print(t2)
在我的机器上打印(Python 3.10/AMD Ryzen 5700x):
19.253960204077885
0.37714757514186203
关于python - 在 Python 中计算 n 元重叠矩阵的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76190115/