mysql - Jaro-winkler 函数 : why is the same score matching very similar and very different words?

我正在使用 jaro-winkler 模糊匹配来匹配名称。

我正在尝试确定相似性分数的截止范围。如果名称差异太大，我想将它们排除在外以供人工审核。

虽然 .4 以下的名称似乎完全不同，但 .4 范围似乎非常相似。

但后来我遇到了奇怪的异常(exception)情况，其中该范围内的一些名称完全不同，而有些名称仅相差一两个字母(请参见下面的示例)。

有人可以解释在相同的匹配分数范围内匹配的差异很大吗？

   Estrella     ANNELISE    0.42 
   Arienna      IREANNA     0.43 
   Tayvia       I TAYVIA    0.43
   Amanda       IZABEL      0.44
   Hunter       JOSHUA      0.44
   Ryder        CHARLES     0.45
   Luis         ELIZABETH   0.45 
   Sebastian    JOSE        0.45 
   Christopher  CHISTOPHE   0.46 
   Genayunique  GENAY-UNI   0.46 
   Andreeaonn   ADREEAONN   0.46
   Chistopher   CHRISTOPH   0.46
   Dazharicon   DAZHARION   0.46
   Jennavecia   JENNACVEC   0.46
   Valentiria   VALENTINA   0.46
   Abel         SAMMUEL     0.46
   Dezarea MarieDEZAREA     0.47
   Alexander    ALEXZANDE   0.47

最佳答案

Jaro-Winkler 距离公式偏向于具有共同开头的字符串。例如，Valentina 和 Valentiria。

它还有一些不太直观的“规则”(参见 wikipedia)。

您可能应该首先确定您期望的差异类型，然后寻找合适的距离公式。例如，在写作中，“angleworm”和“angelworm”很可能会出错，所以这两个字符串之间的距离应该很小。虽然“there”和“three”不匹配的可能性较小，但“ether”更是如此。对于更长的字谜，Jaro 距离可能完全相同，甚至 Winkler 修正也可能不会生效。

正如您在 this page 中所读到的那样(强调我的)

Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

关于mysql - Jaro-winkler 函数 : why is the same score matching very similar and very different words?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48406993/

mysql - Jaro-winkler 函数 : why is the same score matching very similar and very different words?

上一篇：php - 使用 laravel 查询构建器构建查询

下一篇：mysql - 如何在 MySQL 中声明一个变量？