python - 将列表列表与集合列表进行比较的最快方法

标签 python python-3.x list performance

有没有更快的方法来完成以下列表理解?

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

限制

  • listOfSets:无
  • listOfLists:在构建过程中必须保持子列表的顺序,但不一定要保持子列表的顺序。即:
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]

但是

[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
  • ret:ret 必须保持与原始 listOfLists 相同的顺序,如上所述。

我的代码生成以下列表列表。每个列表都包含大小相同的子列表,但子列表的数量会有所不同。即:

listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

我需要过滤此列表列表以删除集合列表中不存在的所有子列表:

listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

注意ret中缺少的[1, 2, 3]

我尝试了以下变体

ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
    ] for lst in listOfLists
]

想法是 not(set(subList) in listOfSets) 将返回更快,因为它只需要找到一个匹配项,但无济于事:

%timeit ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
   ] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

最佳答案

原答案

鉴于其他答案已经非常完整,我不会扩展太多,但是通过使用集合之间的差异,我获得了更好的性能。

让:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

为了比较,这里是 JJ Hassan 在我的机器上运行的第二个示例(另外,请注意我包含了原始问题中的 not in):

>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

现在,使用集合之间的差异:

>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

也可以试试这个选项,因为我相信列表越大,速度上的差异可能越明显。

这背后的想法是你有一个superlist,它是一个lists的列表,每个列表都包含一个sublist,或者在这个一个 tuple 的情况。但是,根据您的要求,中间 lists 不需要保留顺序(只有 superlistsublists),我们希望获取那些在 set_of_tuples 中找不到的元素。因此,中间的lists可以看作是set,取不属于set_of_tuples的元素的操作是微不足道的区别组之间。

编辑

我刚刚通过使用 functoolsitertools 提出了一个稍微快一点的解决方案。然而,这种新的解决方案只有在我们有足够的数据时才会更好。

让我们从之前的解决方案开始:

filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]

现在,通过 map 的简单应用,这就变成了:

filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]

那么我们可以使用operator.sub将其重写为:

from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]

或者,使用普通的list:

from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))

最后,我们再次使用map,这次带来itertools.repeat进入游戏:

from itertools import repeat
from operator import sub

filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))

这个新方法实际上是给定小列表最慢的:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)

但是现在让我们定义更大的列表。我大致使用了您在评论中提到的尺寸:

>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}

使用这些新数据,这是我在我的机器上得到的结果:

>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)

关于python - 将列表列表与集合列表进行比较的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66033921/

相关文章:

python - 根据日期对列表进行排序

python - python 中的 __qualname__ 是什么?

python - 将括号内的数字解析为负数

C++ STL 列出两个结构交叉引用

python - 从字典中删除 NoneTypes

python - 如何在字符串中搜索子字符串值?

python-3.x - ModuleNotFoundError : No module named 'pandas.core.indexes'

python-3.x - SessionNotCreatedException : Message: session not created from disconnected: unable to connect to renderer with ChromeDriver 2. 45 Chrome v71

c# - 使用 HashSet 根据 T 的属性删除不在另一个集合中的项目

python - 计算 Pandas 中列的变化