python - 改进我的代码,将相同的单词分组到一个大的Python列表中,并与其他代码进行比较

标签 python

我一直在阅读一些与相似单词分组相关的其他链接( What is a good strategy to group similar words?Fuzzy Group By, Grouping Similar Words )。我很好奇(1)是否有人可以指导我在第二个链接中找到的算法之一如何工作,以及(2)编程风格与我自己的“天真的”方法相比如何?

如果您只能回答 1 或 2,我就会给您投赞成票。

(1) 有人可以帮助我了解这里发生的事情吗?

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()
    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)

(2) 在我的方法中,我有一个要分组的字符串列表,称为“residencyList”并使用默认字典。

Array(['Psychiatry', 'Radiology Medicine-Prelim',
       'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
       'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
       'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
       'Medicine/Pediatrics',])

我为分组所做的努力。我将其基于 uniqueResList,即 np.unique(residencyList)

d = collections.defaultdict(int)
for i in residencyList:
    for x in uniqueResList:
        if x ==  i:
            if not d[x]:
                #print i, x
                d[x] = i  
                #print d
            if d[x]:
                d[x] = d.get(x, ()) + ', ' + i
        else:
            #print 'no match'
            continue

最佳答案

距离“忍者数学”的简短解释:

 # this is just the edit distance (Levenshtein) between the two words
    def distance(self, s1, s2):
        l1 = len(s1) # length of first word
        l2 = len(s2) # length of second word
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] 
           # make an l2 + 1 by l1 + 1 matrix where the first row and column count up from
           # 0 to l1 and l2 respectively (these will be the costs of
           # deleting the letters that came before that element in each word)
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]: # if the two letters are the same then we
                       # don't have to change them so take the 
                       # cheapest path from the options of
                       # matrix[zz+1][sz] + 1 (delete the letter in s1)
                       # matrix[zz][sz+1] + 1 (delete the letter in s2)
                       # matrix[zz][sz] (leave both letters)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else: # if the two letters are not the same then we
                         # have to change them so take the 
                         # cheapest path from the options of
                         # matrix[zz+1][sz] + 1 (delete the letter in s1)
                         # matrix[zz][sz+1] + 1 (delete the letter in s2)
                         # matrix[zz][sz] + 1 (swap a letter)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1] # the value at the bottom of the matrix is equal to the cheapest set of edits

关于python - 改进我的代码,将相同的单词分组到一个大的Python列表中,并与其他代码进行比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22257388/

相关文章:

python - 以时间间隔自动将日志从 CloudWatch 导出到 S3 存储桶

python - 从 PyODBC 读取多个 DataFrame

python - 使用对象的 id() 作为哈希值

python - Django 事务和并发

python - 如何将可调用对象定义到另一个Python类中

python - 通过代码执行python文件

python - 转换字典中的嵌套列表(列表中的每个元素必须是字典中的键)

python - 如何在 Python Plotly 中绘制 3D 形状的外部而不定义其面?

python - 如何在 python 2.6 中处理特定行之后的数据?

python - 双向移动二维矩阵的有效方法?