python - 如何在 Python 的嵌套列表中获取具有最多共同元素的两个列表

我有一个列表，如下所示

[["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]

我怎样才能得到具有最多共同词的两个列表？在这种情况下，它将是第一个和第二个列表，因为它们都有共同的词测试

我尝试通过查找两个列表的每个组合的交集并跟踪具有最多共同单词的组合来解决这个问题。然而，对于 100,000 个列表，此方法似乎效率低下。我认为这将是(100,000 个选择 2)组合。有更快的方法吗？

这是我的代码尝试

from itertools import combinations

a = [["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]
c= combinations(a,2)


max = 0
for pair in c:
    intersection = set(pair[0]).intersection(pair[1]) 
    if len(intersection) > max:
        max = len(intersection)
        mostsimilar = pair

print mostsimilar

我的程序的输出是我所期望的，但是在更大的测试用例上它非常慢

输出:

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])

最佳答案

根据我对问题的理解，我认为这应该可行:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KDTree, DistanceMetric

data = np.array([' '.join(words) for words in [['This', 'is', 'a', 'test'], ['test', 'something', 'here'], ['cat', 'dog', 'fish']]])

vectorised = CountVectorizer(binary=True).fit_transform(data).todense()
metric = DistanceMetric.get_metric('manhattan')
kdtree = KDTree(vectorised, metric=metric)
distances, indices = [result[:, 1] for result in kdtree.query(vectorised, k=2, dualtree=True)]

nearest_distances = (distances == distances.min())
print(data[nearest_distances])

输出:

['This is a test' 'test something here']

我按以下方式重铸问题:

单词(或句子)的每个 list 都可以表示为稀疏矩阵中的一行，其中特定列中的 1 表示单词存在，0 表示单词不存在，使用 sklearn 的 CountVectorizer。

然后，我们可以看出两个句子的相似度，作为稀疏矩阵中的行，可以通过比较它们在每一列中的元素的值来确定，这就是曼哈顿距离。这意味着我们遇到了最近邻问题。

sklearn 还提供了一个 k 维树类，我们可以使用它来为数据集中的每个点找到最近的两个邻居(因为一个点的最近邻居是它本身)。然后，剩下的就是找到距离最短的邻居，我们可以用它来索引原始数组。

使用 %%timeit，我在 this page 上的文本上测试了我的解决方案与 blhsing 解决方案的运行时间，将导入留在计时循环之外:

# my solution
198 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  
# blhsing's solution
4.76 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

将句子的长度限制在 20 个单词以内:

# my solution
3.2 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# blhsing's solution
6.08 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 如何在 Python 的嵌套列表中获取具有最多共同元素的两个列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55911018/

python - 如何在 Python 的嵌套列表中获取具有最多共同元素的两个列表

上一篇：python - Google App Engine 应用程序升级到 Python 3.x 的途径是什么？

下一篇：python - 将分配分成两行仍然有效吗？