背景
我有一个如下所示的product_visits表:
create table product_visits (product_id int, visitor_id int);
insert into product_visits values
(1, 1),
(1, 2),
(1, 3),
(1, 4),
(1, 5),
(2, 1),
(2, 2),
(2, 3),
(2, 4),
(2, 5),
(3, 1),
(3, 2),
(3, 3),
(4, 1),
(4, 2),
(5, 1);
或
| product_id | visitor_id |
|------------|------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 4 | 1 |
| 4 | 2 |
| 5 | 1 |
我当前正在使用以下 SQL 选择给定产品的访问者也访问过的前 2 个其他产品:
SELECT a.`product_id`, count(a.`product_id`) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` = ?
AND a.`product_id` != ?
GROUP BY a.`product_id`
ORDER BY visits DESC
LIMIT 2
例如,如果运行 Product_id = 1,我将使用上述数据得到以下结果:
| product_id | visits |
|------------|--------|
| 2 | 5 |
| 3 | 3 |
当尝试一次获取一种产品的结果时,这种方法工作正常。
问题
我需要做的是重写上述查询,以便它可以通过单个查询适用于product_visits
表中的所有产品。我仍然需要将结果限制为每个产品的前 n
个结果(例如 2 个)。例如,使用上述数据,我希望看到的结果如下:
| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
| 1 | 2 | 5 |
| 1 | 3 | 3 |
| 2 | 1 | 5 |
| 2 | 3 | 3 |
| 3 | 1 | 3 |
| 3 | 2 | 3 |
| 4 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 1 | 1 |
| 5 | 2 | 1 |
我最接近实现上述目标的尝试是使用以下代码:
SELECT a.`product_id` target_product_id, b.`product_id` related_product_id, count(*) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` != a.`product_id`
GROUP BY a.`product_id`, b.`product_id`
ORDER BY target_product_id ASC, visits DESC
这给了我以下结果,但仍然缺少将结果限制为每个target_product_id
的前n
个匹配:
| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
| 1 | 2 | 5|
| 1 | 3 | 3|
| 1 | 4 | 2|
| 1 | 5 | 1|
| 2 | 1 | 5|
| 2 | 3 | 3|
| 2 | 4 | 2|
| 2 | 5 | 1|
| 3 | 1 | 3|
| 3 | 2 | 3|
| 3 | 4 | 2|
| 3 | 5 | 1|
| 4 | 3 | 2|
| 4 | 1 | 2|
| 4 | 2 | 2|
| 4 | 5 | 1|
| 5 | 3 | 1|
| 5 | 1 | 1|
| 5 | 4 | 1|
| 5 | 2 | 1|
我已经为此绞尽脑汁有一段时间了,但一直无法想出完整的解决方案。
更新#1
我跑了Gordon Linoff的suggested SQL下面针对我的生产数据 - 当然是在开发数据库中。我的 product_visits
表中有大约 260 万条记录。将限制设置为 2,查询运行需要 41.8572 秒。几乎所有时间(40.4 秒)都花在复制到 Tmp 表上。
通过 EXPLAIN
运行该 SQL 的输出如下:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1161898 | Using where; Using filesort |
2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | |
2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 1161898 | |
4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
3 | DERIVED | a | index | PRIMARY,ndx_user | ndx_product | 24 | NULL | 2603025 | Using index; Using temporary; Using filesort |
3 | DERIVED | b | ref | PRIMARY,ndx_user | PRIMARY | 116 | product_visits.a.user | 1 | Using where; Using index |
虽然该 SQL 几乎完全符合我的要求,但其性能却让我丧命。有什么加快速度的想法吗?
最佳答案
我认为MySQL中最简单的方法就是使用变量:
SELECT tr.*
FROM (SELECT tr.*,
(@rn := if(@p = target_product_id, @rn + 1,
if(@p := target_product_id, 1, 1)
)
) as rn
FROM (SELECT a.`product_id` as target_product_id, b.`product_id` as related_product_id,
count(*) visits
FROM `product_visits` a INNER JOIN
`product_visits` b
ON a.`visitor_id` = b.`visitor_id` AND
b.`product_id` != a.`product_id`
GROUP BY a.`product_id`, b.`product_id`
ORDER BY a.`product_id`, COUNT(*) desc
) tr CROSS JOIN
(SELECT @p := -1, @rn := 0) params
) tr
WHERE rn <= 2
ORDER BY target_product_id ASC, visits DESC;
关于mysql - 查找其他人查看的前 N 个产品(在 MySQL 中),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36072020/