python - 优化: Search the best way to compare two list of dict (Python)

我需要优化一个用 python 编写的函数，该函数检查两个字典列表并将差异作为列表返回。

说明:

我有两个输入列表。每个列表都包含一系列格式如下的 dict :

{
    'a': 'foo',
    'b': 'bar',
    'switch': True
}

首先，我必须检查第二个列表中找到的二重奏 a 和 b 是否存在于第一个列表中，如果不存在，我将新的二重奏追加到列表中称为添加。同样，我必须检查第一个列表中找到的二重奏 a 和 b 是否存在于第二个列表中，如果不存在，我将删除的二重奏附加到名为 < 的列表中强>除名。

然后，我必须检查每个列表中现有的二人组之间的 switch 键是否相同。如果没有，我必须将其添加到已切换列表中。

示例:

为了继续这个，这里有一个例子:

# First list in input
first = [
    {
        'a': 'foo',
        'b': 'bar',
        'switch': False
    },{
        'a': 'I_will',
        'b': 'be_delisted',
        'switch': True
    },{
        'a': 'I_will',
        'b': 'be_switched',
        'switch': True
    }
]

# Second list to compare
second = [
    {
        'a': 'foo',
        'b': 'bar',
        'switch': False
    },{
        'a': 'I_am',
        'b': 'new',
        'switch': True
    },{
        'a': 'I_will',
        'b': 'be_switched',
        'switch': False # switched
    }
]

diff = my_diff(first, second)

预期输出:

{
    'added': [{
        'a': 'I_am',
        'b': 'new',
        'switch': True
    }],
    'delisted': [{
        'a': 'I_will',
        'b': 'be_delisted',
        'switch': True
    }],
    'switched': [{
        'a': 'I_will',
        'b': 'be_switched',
        'switch': False
    }]
}

因此有两个截然不同的比较:

列表之间元素的比较
相同现有元素的内容比较

现有代码:

为了在列表之间进行第一次比较，我使用 hash 函数对二重组进行哈希以进行比较。然后，我将此哈希添加到 first_hash 列表和 second_hash 列表中，其中包含每个元素的索引。

像这样:

first_hash = [ ( hash((first[i]['a'], first[i]['b'])), i ) for i in xrange(0, len(first))]
second_hash = [ ( hash((second[i]['a'], second[i]['b'])), i ) for i in xrange(0, len(second))]

我得到了我的添加和除名列表:

added = [ second[ e[1] ] for e in second_hash if e[0] not in (fh[0] for fh in first_hash) ]
delisted = [ first[ e[1] ] for e in first_hash if e[0] not in (sh[0] for sh in second_hash) ]

我得到两个列表中相同的元素，然后将这些元素放入字典中，并使用散列键进行比较:

sames_first = [ (e[0], first[ e[1] ]) for e in first_hash if e[0] in (sh[0] for sh in second_hash) ] # Getting the seconds same elements
sames_second = [ (e[0], second[ e[1] ]) for e in second_hash if e[0] in (fh[0] for fh in first_hash) ] # Getting the first same elements

sfirst = {}
ssecond = {}

for sf in sames_first:
    sfirst[sf[0]] = sf[1]

for ss in sames_second:
    ssecond[ss[0]] = ss[1]

然后，我比较并获取切换后的列表:

switched = [ssecond[e] for e in ssecond.keys() if ssecond[e]['switch'] != sfirst[e]['switch']]

I push the copy ssecond[e] (the element of the second list) to have the new value.

完整代码:

使用pastebin中的测试器在本地进行测试:Pastebin
直接在线测试:Online testing

实际上我明白:

1.92713737488 ms for 100 element
162.150144577 ms for 1000 element
15205.0578594 ms for 10000 element

我的问题是:是否有更有效的方法在大型数据集上执行此任务？ (就像映射对象或其索引和其中一个属性并直接比较它们？)

感谢任何愿意花一点时间阅读并尝试回复我的请求的人:)

最佳答案

您可以将输出格式保存在字典中。使用列表理解，您可以以更合理的时间复杂度获得所需的输出。

    [res['switched'].append(i) if switchDict(i) in first else res['added'].append(i) if i not in first  else None for i in second ]

上面的内容填充了你的res dict的switched(如果该元素在第一个中被发现为打开的)和添加的(如果该元素不存在于第一个)键。

res['delisted']=[i for i in first if i not in second and switchDict(i) not in res['switched']]

类似地，通过检查条件是否不存在于第二个列表中并且不在交换中，使用迭代第一个列表的条件来填充 res 列表的除名键。

编辑是 - 检查 switchDict(i) not in res['switched'] 而不是 switchDict(i) not in secondary在上面的代码片段中，对于 10000 个元素，执行时间减少了 500 毫秒(大约)!

因此，

def switchDict(d):
    return {'a':d['a'],'b':d['b'],'switch':not d['switch']}

def my_diff(first, second):
    res = dict.fromkeys(['added','switched','delisted'],[]) # to make things more pythonic!
    second = filter(None,[res['switched'].append(i) if switchDict(i) in first else res['added'].append(i) if i not in first  else i for i in second ]) 
    # filtering the missing elements alone that may not be delisted as storing it as second
    #thereby reducing the execution time by another 1000ms(approx)
    res['delisted']=[i for i in first if i not in second and switchDict(i) not in res['switched']]
    return res

将为您提供适当的结果

0.0457763671875 ms for 10 element
1.32894515991 ms for 100 element
64.845085144 ms for 1000 element
6941.58291817 ms for 10000 element

(此处的时间取决于您共享的 python 文件生成的随机输入!)

希望对你有帮助!

关于python - 优化: Search the best way to compare two list of dict (Python)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42996426/

python - 优化: Search the best way to compare two list of dict (Python)

说明:

示例:

现有代码:

完整代码:

上一篇：python - readlines() 是否保证从当前位置而不是文件开头读取(在所有 Python 实现中)？

下一篇：python - 使用 Python 在 Zebra 打印机上打印 XML