有没有更快的方法来完成以下列表理解?
ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
限制
listOfSets
:无listOfLists
:在构建过程中必须保持子列表的顺序,但不一定要保持子列表的顺序。即:
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]
但是
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
ret
:ret
必须保持与原始listOfLists
相同的顺序,如上所述。
我的代码生成以下列表列表。每个列表都包含大小相同的子列表,但子列表的数量会有所不同。即:
listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]
我需要过滤此列表列表以删除集合列表中不存在的所有子列表:
listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]
ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]
注意ret中缺少的[1, 2, 3]
。
我尝试了以下变体
ret = [
[
subList
for subList in lst
if not(set(subList) in listOfSets)
] for lst in listOfLists
]
想法是 not(set(subList) in listOfSets)
将返回更快,因为它只需要找到一个匹配项,但无济于事:
%timeit ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ret = [
[
subList
for subList in lst
if not(set(subList) in listOfSets)
] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
最佳答案
原答案
鉴于其他答案已经非常完整,我不会扩展太多,但是通过使用集合之间的差异,我获得了更好的性能。
让:
>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
为了比较,这里是 JJ Hassan 在我的机器上运行的第二个示例(另外,请注意我包含了原始问题中的 not in
):
>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
现在,使用集合之间的差异:
>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
也可以试试这个选项,因为我相信列表越大,速度上的差异可能越明显。
这背后的想法是你有一个superlist,它是一个lists的列表,每个列表都包含一个sublist,或者在这个一个 tuple
的情况。但是,根据您的要求,中间 lists 不需要保留顺序(只有 superlist 和 sublists),我们希望获取那些在 set_of_tuples
中找不到的元素。因此,中间的lists可以看作是set
,取不属于set_of_tuples
的元素的操作是微不足道的区别组之间。
编辑
我刚刚通过使用 functools
和 itertools
提出了一个稍微快一点的解决方案。然而,这种新的解决方案只有在我们有足够的数据时才会更好。
让我们从之前的解决方案开始:
filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
现在,通过 map
的简单应用,这就变成了:
filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]
那么我们可以使用operator.sub
将其重写为:
from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]
或者,使用普通的list
:
from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))
最后,我们再次使用map
,这次带来itertools.repeat
进入游戏:
from itertools import repeat
from operator import sub
filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
这个新方法实际上是给定小列表最慢的:
>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
但是现在让我们定义更大的列表。我大致使用了您在评论中提到的尺寸:
>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}
使用这些新数据,这是我在我的机器上得到的结果:
>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
关于python - 将列表列表与集合列表进行比较的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66033921/