Python、嵌套循环、匹配和性能

我正在尝试使用 Python 2.7 和 Levenshtein 函数将姓氏列表与全名列表进行匹配。为了减少工作量，我只在首字母相同时才匹配(尽管这在性能方面似乎没有太大区别)。如果找到匹配项，则匹配词将从全名中删除(以使后续的名字匹配更容易)。两个列表都包含几万个条目，所以我的解决方案相当慢。如果不解析全名，我怎么能加快速度？到目前为止，这是我所拥有的(对于姓氏由多个单词组成的情况，我省略了一些 if 条件):

import Levenshtein

listoflastnames=(['Jones', 'Sallah'])
listoffullnames=(['Henry', 'Jones', 'Junior'],['Indiana', 'Jones'])


def match_strings(lastname, listofnames):
    match=0
    matchedidx=[]
        for index, nameelement in enumerate(listofnames):        
            if lastname[0]==nameelement [0]:
                if Levenshtein.distance(nameelement, lastname)<2:
                    matchedidx.append(index)
                    match=match+1
    if match==1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
    return 1, newnamelist 
return 0, listofnames



for x in listoflastnames:
    for y in listoffullnames:
        match, newlistofnames=match_strings(x,y)
        if match==1:
            #go to first name match...

如有任何帮助，我们将不胜感激!

更新:与此同时，我使用了多处理模块让我的所有 4 个内核都处理了这个问题，而不是只有一个，但是匹配仍然需要很多时间。

最佳答案

这简化了 match_string 函数中的 for 循环，但在我的测试中并没有显着提高速度。最大的损失是在两个包含姓氏和全名的 for 循环中。

def match_strings(lastname, listofnames):
    firstCaseMatched = [name for name in listofnames if lastname[0] == name[0]]
    if len(firstCaseMatched):
        matchedidx = [index for index, ame in enumerate(firstCaseMatched) if Levenshtein.distance(lastname, name) < 2]
        match = len(matchedidx)
    else:
        match = 0
    if match == 1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
        return 1, newnamelist
    return 0, listofnames

您可能需要对已知姓氏列表进行排序，将它们拆分为每个起始字符的 dict。然后将名称列表中的每个名称与其匹配。

假设全名列表始终将名字作为第一个元素。您可以将比较限制为仅对其他元素进行比较。

关于Python、嵌套循环、匹配和性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20460681/

Python、嵌套循环、匹配和性能

上一篇：python - 将 python pandas df 替换为基于条件的第二个数据帧的值

下一篇：python - 在 Windows 上安装 gsutil