python - 计算Python列表中的编辑距离

标签 python levenshtein-distance edit-distance

我有一个字符串列表，我想根据levenstein距离过滤掉过于相似的字符串。所以如果 lev(list[0], list[10]) < 50 ;然后del list[10] 。有什么方法可以更有效地计算列表中每对字符串之间的距离吗？谢谢!!

data2= []
for i in data:
    for index, j in enumerate(data):
        s = levenshtein(i, j)
        if s < 50:
            del data[index]
    data2.append(i)

上面相当愚蠢的代码计算时间太长......

最佳答案

如果我们只保留命中字符串的索引并稍后跳过它们会怎么样？我忽略了 enumerate() 和 del() 的权重以及命中百分比是多少(即必须从数据集中删除多少字符串)。

THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset

tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
    for j in xrange(len(data)):
        if j != idx and levenshtein(i, data[j]) < THRESHOLD:
            tbr[j] = True
    idx += 1

# print tbr
data2 = []
idx = -1
for d in data:
    idx += 1
    if idx in tbr:
        continue # skip this string
    data2.append(d)
# print data2

关于python - 计算Python列表中的编辑距离，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29447642/

上一篇：python - 如何完全或部分匹配 python 中的正则表达式

下一篇：python - 与列的协方差

相关文章：

python - 如何将 pandas 数据帧的对象值转换为小时数？

tsql - 尝试在 T-SQL 查询中使用 Levenshtein 距离 - 请帮助优化

Oracle 使用通配符进行模糊文本搜索

r - 有利于子串且与词序无关的字符串距离度量吗？

python - 如何将 gamlss 与 rpy2 一起使用

python - pygame字体属性错误

python - 通过模拟更改日志级别

c++ - 如何调整 Levenshtein 距离算法以将匹配限制为单个单词？

php - 具有错误字符容忍度的最长公共(public)子串

algorithm - 编辑距离(动态规划): Aren't insertion and deletion the same thing?