python - 如何查找两行数据之间的相似性

标签 python r algorithm similarity cosine-similarity

我尝试以编程方式从数据集中删除几乎重复的数据之一。我的数据集在逻辑上如下表所示。如您所见,数据集中有两行,人们可以轻松理解这两个数据是相关的,并且可能是由同一个人添加的。

enter image description here

我对这个问题的解决方案是使用 Levenshtein 分别比较字段(姓名、地址、电话号码)并找出它们的相似度。然后我计算平均比率为 0.77873。这个结果的相似度似乎很低。我的Python代码是这样的

from Levenshtein import ratio
name     =  ratio("Game of ThOnes Books for selling","Selling Game of Thrones books")
address  =  ratio("George Washington street","George Washington st.")
phone    =  ratio("555-55-55","0(555)-55-55")

total_ratio = name+address+phone
print total_ratio/3 #Average ratio

我的问题是两者比较行数据的最佳方式是什么?执行此操作需要哪些算法或方法?

最佳答案

我们可以计算行之间的距离矩阵,形成簇并选择簇成员 作为相似行的候选。

使用stringdist包中的Rstringdistmatrix函数可以计算之间的距离 字符串输入。

stringdist支持的距离方法如下。请参阅package manual 了解更多详情

#Method name;   Description
#osa    ; Optimal string aligment, (restricted Damerau-Levenshtein distance).
#lv ; Levenshtein distance (as in R's native adist).
#dl ; Full Damerau-Levenshtein distance.
#hamming    ; Hamming distance (a and b must have same nr of characters).
#lcs    ; Longest common substring distance.
#qgram  ;q-gram distance.
#cosine ; cosine distance between q-gram profiles
#jaccard    ; Jaccard distance between q-gram profiles
#jw ; Jaro, or Jaro-Winker distance.
#soundex    ; Distance based on soundex encoding (see below)

数据:

library("stringdist")

#have modified the data slightly to include dissimilar datapoints
Date = c("07-Jan-17","06-Feb-17","03-Mar-17")
name     =  c("Game of ThOnes Books for selling","Selling Game of Thrones books","Harry Potter BlueRay")
address  =  c("George Washington street","George Washington st.","Central Avenue")
phone    =  c("555-55-55","0(555)-55-55","111-222-333")
DF = data.frame(Date,name,address,phone,stringsAsFactors=FALSE)

DF
#       Date                             name                  address        phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street    555-55-55
#2 06-Feb-17    Selling Game of Thrones books    George Washington st. 0(555)-55-55
#3 03-Mar-17             Harry Potter BlueRay           Central Avenue  111-222-333

层次聚类:

rowLabels = sapply(DF[,"name"],function(x) paste0(head(unlist(strsplit(x," ")),2),collapse="_" ) )

#create string distance matrix, hierarchical cluter object and corresponding plot
nameDist = stringdistmatrix(DF[,"name"])
nameHC = hclust(nameDist)

plot(nameHC,labels = rowLabels ,main="HC plot : name")

enter image description here

addressDist = stringdistmatrix(DF[,"address"])
addressDistHC = hclust(addressDist)

plot(addressDistHC ,labels = rowLabels, main="HC plot : address")

enter image description here

phoneDist = stringdistmatrix(DF[,"phone"])
phoneHC = hclust(phoneDist)

plot(phoneHC ,labels = rowLabels, main="HC plot : phone" )

enter image description here

相似行:

这些行在该数据集中一致形成两个簇,以识别我们可以做的簇的成员

clusterDF = data.frame(sapply(DF[,-1],function(x) cutree(hclust(stringdistmatrix(x)),2) ))
clusterDF$rowSummary = rowSums(clusterDF)

clusterDF
#  name address phone rowSummary
#1    1       1     1          3
#2    1       1     1          3
#3    2       2     2          6


#row frequency

rowFreq  = table(clusterDF$rowSummary)
#3 6 
#2 1

#we filter rows with frequency > 1
similarRowValues  = as.numeric(names(which(rowFreq>1)))


DF[clusterDF$rowSummary == similarRowValues,]
#       Date                             name                  address        phone
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street    555-55-55
#2 06-Feb-17    Selling Game of Thrones books    George Washington st. 0(555)-55-55

这个演示在简单/玩具数据集上运行良好,但在真实数据集上,您必须修改字符串距离计算方法、集群数量等。但我希望这能为您指明正确的方向。

关于python - 如何查找两行数据之间的相似性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42090680/

相关文章:

python - PyQt Tableview 背景颜色基于数值?

r - 更新 R6 对象实例中的方法定义

r - 在我的分组计数中使用 R 中的 data.table 重复行

algorithm - c++ 中具有最小累积高度差的直方图峰值识别和高斯拟合

javascript - 找出数组中具有给定总和的子数组的数量

python - 从 dask 数据帧提供者收集属性

python - Windows 上的 Electron 重建串行端口错误

python - 如何持久化 patsy DesignInfo?

r - 在前 N 个字符后截断字符串

algorithm - 什么算法可用于以相当优化的方式将不同大小的矩形打包成尽可能小的矩形?