python - 计算允许 R 中的 QWERTY 错误的 Levenshtein 距离

标签 python r levenshtein-distance qwerty

<分区>

我正在计算 R 中用户输入的公司名称与财富 1000 强名单之间的 Levenshtein 距离,但允许 QWERTY 打字错误。例如,McdimldesMcDonalds 的距离应为 2,因为 i 紧邻 o m 紧挨着 n

还有另一个实现尝试,但在 Python 中 (click here).非常感谢任何帮助。

如果需要添加额外的细节来澄清问题,请告诉我。

最佳答案

也许你可以在此基础上构建一些东西:

## from the link in the linked python answer:
# txt <- "'q': {'x':0, 'y':0}, 'w': {'x':1, 'y':0}, 'e': {'x':2, 'y':0}, 'r': {'x':3, 'y':0}, 't': {'x':4, 'y':0}, 'y': {'x':5, 'y':0}, 'u': {'x':6, 'y':0}, 'i': {'x':7, 'y':0}, 'o': {'x':8, 'y':0}, 'p': {'x':9, 'y':0}, 'a': {'x':0, 'y':1},'z': {'x':0, 'y':2},'s': {'x':1, 'y':1},'x': {'x':1, 'y':2},'d': {'x':2, 'y':1},'c': {'x':2, 'y':2}, 'f': {'x':3, 'y':1}, 'b': {'x':4, 'y':2}, 'm': {'x':5, 'y':2}, 'j': {'x':6, 'y':1}, 'g': {'x':4, 'y':1}, 'h': {'x':5, 'y':1}, 'j': {'x':6, 'y':1}, 'k': {'x':7, 'y':1}, 'l': {'x':8, 'y':1}, 'v': {'x':3, 'y':2}, 'n': {'x':5, 'y':2}"
# txt <- strsplit(txt, "\\},\\s?")[[1]]
# m <- t(sapply(regmatches(txt, regexec("'(.)':\\s*\\{'x':(\\d+),\\s*'y':(\\d+).*", txt)), "[", -1))
# m <- matrix(as.numeric(m[,-1]), ncol=2, dimnames = list(m[,1],c("x","y")))
# dput(m)
m <- structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 1, 1, 2, 2, 3, 
  4, 5, 6, 4, 5, 6, 7, 8, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
  2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2), .Dim = c(27L, 
  2L), .Dimnames = list(c("q", "w", "e", "r", "t", "y", "u", "i", 
  "o", "p", "a", "z", "s", "x", "d", "c", "f", "b", "m", "j", "g", 
  "h", "j", "k", "l", "v", "n"), c("x", "y")))
m["m", ] <- c(6,2) # 5,2 seems wrong...

f <- function(a, b) {
  posis <- lapply(strsplit(c(a, b), "", T), function(x) m[x,,drop=F])
  d <- abs(posis[[1]]-posis[[2]])
  idx <- which(rowSums(d>1)==0)
  if (length(idx)>0) rownames(posis[[1]])[idx] <- rownames(posis[[2]])[idx]
  paste(rownames(posis[[1]]), collapse="")
}
a <- tolower("Mcdimldes") # make it case-insensitive
b <- tolower("McDonalds")
adist(a,b) # regular distance
# [1,]    4
newa <- f(a, b) # replace possible typo chars
adist(newa,b) # new dist is 2 - as requested
#      [,1]
# [1,]    2

矩阵中的键盘布局:

keyb <- sweep(m, 2, c(1, -1), "*")
plot(keyb, type = "n")
text(keyb, rownames(keyb))
grid()

enter image description here

关于python - 计算允许 R 中的 QWERTY 错误的 Levenshtein 距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43946912/

相关文章:

python 节省纯数字字 rune 本文件大小的注意事项

python - list of list 的处理

php - 拼写检查街道地址的最佳方法是什么?

c - levenshtein 总是无限循环递归C

python - 我想在 python 中实现 Karatsuba 乘法

python - 如何关闭 tempfile.mkstemp 中的文件?

r - 如何更改数据集中某些信息的列?

r - 更改 data.frame 中的元素列

r - R 中包含异常值的 map

php - 在一些不完整的数据中查找单个用户记录?