php - 获取MySQL数据库中重复次数最多的相似字段

假设我们有一个类似的数据库:

Actions_tbl:

--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3

Users_tbl:

-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

Like: Araz reads the book

My Idea was to

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

然后在php中使用来一一检查相似之处

levenshtein()

但是这个根本没有性能!

假设我想对一个大数据库和几个不同的表做同样的事情。这会毁掉我的服务器!!!

还有更好的想法吗？

在 http://www.artfulsoftware.com/infotree/queries.php#552这 levenshtein()函数是作为MySQL函数实现的，但首先，你认为它有足够的性能吗？然后，在我的例子中如何使用它？也许自连接货车可以解决这个问题，但我不太擅长 sql!

*相似 Action ，是包含超过X%常用词的 Action

**更多信息和注释:**

我仅限于 PHP 和 MySQL。
这只是一个例子，在我的实际项目中， Action 都是很长的段落。这就是为什么性能是一个问题。真实的场景是:用户为多个项目输入了其项目的描述，这些数据可能太相似(用户将具有相同的工作领域)，我想自动填充(基于以前的填充)下一个项目的描述，以节省时间。
如果您能提供任何实用解决方案，我将不胜感激。我查看了NLP相关的解决方案，虽然它们很有趣，但我认为其中没有很多可以准确且可以使用PHP实现。
输出应该有意义，并且像所有其他项目一样是一个正确的段落。这就是为什么我考虑从以前的中进行选择。

感谢您的理智回答，如果您能阐明一些情况，我们将不胜感激

最佳答案

你所说的是一个文本聚类过程。您试图找到相似的文本片段，并任意选择其中之一。我不熟悉任何进行这种形式的文本挖掘的数据库。

根据您的描述，一种非常基本的文本挖掘技术可能会起作用。使用除用户名之外的所有单词创建术语-文档矩阵。然后使用奇异值分解得到最大的奇异值和向量(这是相关矩阵的第一主成分)。类似的事件应该沿着这条线聚集。

如果您的词汇量有限并且表格中包含术语，您可以通过重叠单词的比例来衡量两个操作之间的距离。您有 Action 中所有单词的列表吗？

关于php - 获取MySQL数据库中重复次数最多的相似字段，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11538409/

php - 获取MySQL数据库中重复次数最多的相似字段

上一篇：mysql - 如何使用 codeigniter 锁定表？

下一篇：php - "Changing"MySQL 中所有现有的和新的行 ID，而不影响应用程序