python - 如何更快地从字符串列表构建自定义字典

背景

我想创建一个字典，每个单词都有一个唯一的 id 用于单词嵌入。数据集如下所示:

s_lists = [['I', 'want', 'to', 'go', 'to', 'the', 'park'],
           ['I', 'want', 'to', 'quit', 'the', 'team']]

下面的函数是建字典

def build_dict(input_list, start=2):
    """
    build dictionary
    start with 2，1 for unknow word，0 for zero padding

    :param input_list:
    :param start:
    :return: custom dictionary
    """

    whole_set = set()
    for current_sub_list in input_list:
         # remove duplicate elements
        current_set = set(current_sub_list)
        # add new element into whole set
        whole_set = whole_set | current_set
    return {ni: indi + start for indi, ni in enumerate(whole_set)}

运行并输出

{'I': 7,'go': 2,'park': 4,'quit': 8, 'team': 6,'the': 5,'to': 9,'want': 3}

问题

当我将它用于大型数据集(大约 50w 个字符串)时，它将花费大约 30s (ENV mbpr15-i7)。它太慢，我想寻找一个解决方案来提高性能，但我目前还不知道。

最佳答案

使用 itertools.chain 尝试以下代码.在我的测试用例中，它的工作速度大约快了 4 倍:

from itertools import chain

start = 2
{it: n + start for n, it in enumerate(set(chain(*s_lists)))}

关于python - 如何更快地从字符串列表构建自定义字典，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50941661/

上一篇：algorithm - "amortized"这个词在算法的摊销分析中是什么意思？

下一篇：JavaScript - 从字典键生成组合并动态保留键名

python - 编译 Boost.Python 快速入门时出错

python - python 中意外的关键字参数单击

algorithm - 在 Fortran 中使用蒙特卡罗方法估计 pi

set - 用 python 订购东西......？

java - 使用 keySet() 方法然后将 Set 更改为字符串数组？ java

python - 用新的 Dataframe 替换一行

c++ - 通过包含两个变量的键进行二进制搜索

algorithm - 如何证明有重现？

java - 创建集合或从列表中删除哪个更快？