MySql 逐项推荐性能

我发现 MySql 查询存在无法解释的性能问题。

数据是一个 MySql InnoDB 表，包含 385 万行项目到项目的关联数据。

For item "item_i", another item "also_i" was ordered by "count_i" people.

CREATE TABLE `hl_also2sm` (
  `item_i` int(10) unsigned NOT NULL DEFAULT '0',
  `also_i` int(10) unsigned NOT NULL DEFAULT '0',
  `count_i` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`item_i`,`also_i`),
  KEY `count_i` (`count_i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

通过获取项目列表、查找相关项目并返回 MySql 查询运行所需的大致时间来完成示例关联。

// Javascript in NodeJS with MySql, on Debian Linux
var sql = require('./routes/sqlpool'); // connects to DB
var cmd = util.promisify(sql.cmd); // Promise of raw MySql command function

async function inquiry(NumberOfItems){
  // generate random list of items to perform correlation against
  var rtn = await cmd(`select DISTINCT item_i from hl_also2sm order by RAND() limit ${NumberOfItems}`);
  var items_l = rtn.map((h)=>{return h.item_i});

  var ts = Date.now();

  // get top 50 correlated items
  var c = `select also_i,COUNT(*) as cnt,SUM(count_i) as sum from hl_also2sm 
    where item_i IN (${items_l.join(",")}) 
    AND also_i NOT IN (${items_l.join(",")}) 
    group by also_i 
    order by cnt DESC,sum DESC limit 50`;
  await cmd(c);

  var MilliSeconds = Date.now()-ts;
  return MilliSeconds;
};

在一系列项目上进行测试

async function inquiries(){
 for (items=200;items<3000;items+=200) {
   var Data = [];
   for (var i=0;i<10;i++) {
     Data.push(await inquiry(items));
   }
   Data.sort();
   console.log(`${items} items - min:${Data[0]} max:${Data[9]}`);
 }

结果是

200 items - min:315 max:331
400 items - min:1214 max:1235
600 items - min:2669 max:2718
800 items - min:4796 max:4823
1000 items - min:6872 max:7006
1200 items - min:134 max:154
1400 items - min:147 max:169
1600 items - min:162 max:198
1800 items - min:190 max:212
2000 items - min:210 max:244
2200 items - min:237 max:258
2400 items - min:248 max:293
2600 items - min:263 max:302
2800 items - min:292 max:322

这是非常令人费解的。

为什么 2000 个项目比 1000 个项目快 25 倍以上？

选择 EXPLAIN 的 1000 项是

| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
|  1 | SIMPLE      | hl_also2sm | index | PRIMARY       | count_i | 4       | NULL | 4043135 | Using where; Using index; Using temporary; Using filesort |

2000 选择解释是

| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
|  1 | SIMPLE      | hl_also2sm | range | PRIMARY       | PRIMARY | 4       | NULL | 758326 | Using where; Using temporary; Using filesort |

我运行了很多次，每次都产生相似的结果。

是的，我的许多用户通过浏览量、评论、查看图片或订购对数千种商品表现出了兴趣。我想为他们制作一个好的“你可能也喜欢”。

问题摘要

select  also_i,
        COUNT(*) as cnt,
        SUM(count_i) as sum
    from  hl_also2sm
    where  item_i     IN (...)   -- Varying the number of IN items
      AND  also_i NOT IN (...)   -- Varying the number of IN items
    group by  also_i
    order by  cnt DESC, sum DESC
    limit  50

对于 IN 列表中的 <= 1K 项，使用 KEY(count_i) 的查询运行速度较慢。
对于 IN 列表中超过 1K 的项目，查询会进行表扫描并且运行速度更快。
为什么？？

最佳答案

改变

PRIMARY KEY (`item_i`,`also_i`)

至

KEY (`item_i`)
KEY (`also_i`)

似乎解决了这个问题。

CREATE TABLE `hl_also2sm` (
  `item_i` int(10) unsigned NOT NULL DEFAULT '0',
  `also_i` int(10) unsigned NOT NULL DEFAULT '0',
  `count_i` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `count_i` (`count_i`),
  KEY `item_i` (`item_i`),
  KEY `also_i` (`also_i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

表现

200 min:113 max:85
400 min:148 max:193
600 min:225 max:268
800 min:292 max:362
1000 min:333 max:450
1200 min:360 max:536
1400 min:521 max:618
1600 min:607 max:727
1800 min:698 max:789
2000 min:767 max:841
2200 min:765 max:952
2400 min:1000 max:987
2600 min:1011 max:1241
2800 min:1118 max:1186

这看起来很合理，但我希望它更快。我们将不胜感激有关重新构建此结构以获得更好性能的建议。

尝试“USE INDEX(PRIMARY)”来强制使用 key ，速度较慢。
删除 count_i 上的索引速度较慢。

更改 ENGINE=MEMORY，因为这是一个只读表，足够小以适合内存(16GB 计算机上的 200MB 表内存镜像)，产生:

200 min:16 max:23
400 min:28 max:38
600 min:46 max:56
800 min:58 max:69
1000 min:71 max:89
1200 min:100 max:99
1400 min:105 max:99
1600 min:116 max:132
1800 min:126 max:153
2000 min:139 max:165
2200 min:158 max:181
2400 min:171 max:194
2600 min:197 max:208
2800 min:203 max:223

这对于我的目的来说似乎非常合理。

关于MySql 逐项推荐性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56994089/

MySql 逐项推荐性能

上一篇：mysql - 如何选择按日期 DESC 排序并按 mysql 字段分组的最后两条记录？

下一篇：mysql - 更新、多表连接、从 case 表达式设置值