我想从联系人表中查找名称与声音匹配的所有重复名称。例如:Rita 或 Reeta、Microsoft 或 Microsift、Mukherjee 或 Mukherji。
我使用了以下查询:
SELECT contacts.id
FROM contacts
INNER JOIN (
SELECT first_name, last_name, count(*) AS rows
FROM contacts
WHERE deleted = 0
GROUP BY SOUNDEX(first_name), SOUNDEX(last_name)
HAVING count(rows) > 1
) AS p
WHERE contacts.deleted = 0
AND p.first_name SOUNDS LIKE contacts.first_name
AND p.last_name SOUNDS LIKE contacts.last_name
ORDER BY contacts.date_entered DESC
上面的查询给出了正确的结果,但当有很多记录时会花费很多时间。
最佳答案
我不知道还有比 SOUNDEX()
更好的( native )方法。它慢的原因是因为它是一个函数,因此需要处理所有记录以计算值并从那里开始工作。解决这个问题的方法是将结果直接存储到表中。我对 MySQL 中的这些功能没有经验,但根据 documentation看来您可以将 WHERE
子句转换为
[...] AND SOUNDEX(p.first_name) = SOUNDEX(contacts.first_name) [...]
因此,如果您已经预先计算了这些值(并建立了索引!),则搜索匹配记录的速度应该会快得多!
也就是说,我很难弄清楚您的查询。我认为您不需要 HAVING COUNT(*) > 1
那里,即使如此,我对您想要如何分组/过滤联系人感到困惑!?
你想要这样的东西吗:
SELECT c1.id as contact_id,
c2.id as similar_id
FROM contacts c1
JOIN contacts c2
ON c2.id <> c1.id
AND c2.deleted = 0
AND SOUNDEX(c2.first_name) = SOUNDEX(c1.first_name)
AND SOUNDEX(c2.last_name) = SOUNDEX(c1.last_name)
WHERE c1.deleted = 0
ORDER BY c1.date_entered DESC
然后您可以按照上面的建议进行优化
SELECT c1.id as contact_id,
c2.id as similar_id
FROM contacts c1
JOIN contacts c2
ON c2.id <> c1.id
AND c2.deleted = 0
AND c2.first_name_soundex = c1.first_name_soundex
AND c2.last_name_soundex = c1.last_name_soundex
WHERE c1.deleted = 0
ORDER BY c1.date_entered DESC
其中first_name_soundex保存SOUNDEX(first_name)等的结果。
建立索引时,您可能希望对 deleted
、first_name_soundex
、last_name_soundex
建立覆盖索引。
(据我所知 MySQL 还不支持过滤索引,否则您可以将索引限制为仅 deleted = 0
)。
关于mysql - 获取具有相似声音的记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22983065/