python - 如何在Python中有效地检查第一个具有该概念的列表

标签 python

我有如下 5 个列表。

list1 = [[111, ["food", "fruits", "vegetables"]], [112, ["mango", "apples", "grapes", "pears", "passion fruit"]]]
list2 = [[110, ["transport", "car", "van", "bus", "jeep"]], [109, ["trams", "trains", "passenger", "driver"]], [108, ["traffic", "lights"]]]
list3 = [[111, ["book", "letters", "library", "reading"]], [112, ["education", "jobs", "companies", "salary"]]]
list4 = [[111, ["food", "curry", "spices", "rice", "fruits", "vegetables"]], [112, ["fruits", "vegetables", "farms", "farmers"]]]
list5 = [[111, ["food", "industry", "delivery"]], [112, ["fresh", "curry", "food", "pears", "passion fruit"]]]

我也有一个概念列表。

myconcepts = ["fruits", "curry"]

我想找到第一个包含 myconcepts 列表中概念的列表。即

"fruits" -> list1
"curry" -> list4

我目前正在使用以下代码来执行此操作

mylists = [list1, list2, list3, list4, list5]
for concept in myconcepts:
   initial_list = ""
   counting = 1

   for mylist in mylists:
        for item in mylist:
            if concept in item[1]:
                initial_year = str(counting)
                break

        if len(initial_year) > 0:
            break
        else:
            counting = counting + 1
 print(counting)

这对于小型数据集来说效果很好。然而,我有一个巨大的数据集,其中有近 25 个列表,每个列表有近 500 万条记录。我的概念列表大约有 15000 个。因此,我的代码需要很多时间来运行。我想知道在 python 中是否有更有效的方法来做到这一点?

如果需要,我很乐意提供更多详细信息。

最佳答案

这是一种使用 set 的方法,与在 list 中查找相比,它可以加快使用 in 查找值的速度。

list1 = [[111, ["food", "fruits", "vegetables"]], [112, ["mango", "apples", "grapes", "pears", "passion fruit"]]]
list2 = [[110, ["transport", "car", "van", "bus", "jeep"]], [109, ["trams", "trains", "passenger", "driver"]], [108, ["traffic", "lights"]]]
list3 = [[111, ["book", "letters", "library", "reading"]], [112, ["education", "jobs", "companies", "salary"]]]
list4 = [[111, ["food", "curry", "spices", "rice", "fruits", "vegetables"]], [112, ["fruits", "vegetables", "farms", "farmers"]]]
list5 = [[111, ["food", "industry", "delivery"]], [112, ["fresh", "curry", "food", "pears", "passion fruit"]]]

myconcepts = ["fruits", "curry"]

# flatten lists and generate frozensets
flatsets = [[frozenset(l[1]) for l in lists] for lists in [list1, list2, list3, list4, list5]]

# a function to retrieve indices for the strings to find
def get_idx(setlist, concept):
    for ix_f, fset in enumerate(setlist):
        for ix_s, s in enumerate(fset):
            if concept in s:
                return ix_f
    return None

# generate a list holding the index of each concept
ix_concepts = [None for _ in myconcepts]           
for ix_c, c in enumerate(myconcepts):
    ix_concepts[ix_c] = get_idx(flatsets, c)

# show result    
listnames = ['list1', 'list2', 'list3', 'list4', 'list5']    
for i, c in enumerate(myconcepts):
    print(f"concept '{c}' found first in {listnames[ix_concepts[i]]}")
# concept 'fruits' found first in list1
# concept 'curry' found first in list4

但是,考虑到您的数据量很大,15k * 25 * 5M,我认为这不是针对实际问题的 1:1 解决方案。正如这里已经提到的,需要进行复杂的数据准备。另外,我认为现在的 O(N²) 搜索算法(忽略展平列表等所需的时间)有望消磨大量时间。

关于python - 如何在Python中有效地检查第一个具有该概念的列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58891155/

相关文章:

python - 如何安装 six.moves.xmlrpc_client?

python - python中对象没有属性错误

python - Python语言如何知道标识符的类型?

python - 编译成字节码占用太多内存

python - virtualenv 使用源代码库

python - 如何从另一个类更改类的属性

python - 在我只有 ftp 访问权限的服务器上安装 pip 包?

python - 删除文本中的相关连字符

python - 为什么 factory_boy 优于直接在测试中使用 ORM?

python - 如何将 fastai 图像从 open_image() 格式转换为 opencv?