由于这是我的第一篇文章,因此我似乎只能发布 1 个链接,因此我在底部列出了我所指的网站。简而言之,我的目标是让数据库更快地返回结果,我尝试包含尽可能多的相关信息,以帮助构建帖子底部的问题。
机器信息
8 processors
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
cache size : 6144 KB
cpu cores : 4
top - 17:11:48 up 35 days, 22:22, 10 users, load average: 1.35, 4.89, 7.80
Tasks: 329 total, 1 running, 328 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 87.4%id, 12.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8173980k total, 5374348k used, 2799632k free, 30148k buffers
Swap: 16777208k total, 6385312k used, 10391896k free, 2615836k cached
但是,我们正在考虑将 mysql 安装移动到集群中具有 256 GB 内存的另一台机器上
表格信息
我的 MySQL 表看起来像
CREATE TABLE ClusterMatches
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cluster_index INT,
matches LONGTEXT,
tfidf FLOAT,
INDEX(cluster_index)
);
它有大约 18M 行,有 1M 唯一 cluster_index 和 6K 唯一匹配。我在 PHP 中生成的 sql 查询看起来像。
SQL查询
$sql_query="SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters."))
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";
其中 $cluster 包含大约 3,000 个逗号分隔的 cluster_index 字符串。此查询使用大约 50,000 行,运行大约需要 15 秒,当再次运行相同的查询时,运行大约需要 1 秒。
用法
子查询
基于这篇文章 [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] 和查询时间的改进,我相信我的子查询可以被索引。
mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000)
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 48528 | Using temporary; Using filesort |
| 2 | DERIVED | ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 53689 | Using where |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
根据这篇较早的文章 [Optimizing MySQL: Queries and Indexes][2] in Extra info - 在这里看到的不好的是“使用临时”和“使用文件排序”
MySQL 配置信息
查询缓存可用,但由于大小当前设置为零而有效关闭
mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name | Value |
+---------------------------------+----------------------+
| bdb_cache_size | 8384512 |
| binlog_cache_size | 32768 |
| expire_logs_days | 0 |
| have_query_cache | YES |
| flush | OFF |
| flush_time | 0 |
| innodb_additional_mem_pool_size | 1048576 |
| innodb_autoextend_increment | 8 |
| innodb_buffer_pool_awe_mem_mb | 0 |
| innodb_buffer_pool_size | 8388608 |
| join_buffer_size | 131072 |
| key_buffer_size | 8384512 |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 18446744073709547520 |
| sort_buffer_size | 2097144 |
| table_cache | 64 |
| thread_cache_size | 0 |
| query_cache_limit | 1048576 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 0 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| read_rnd_buffer_size | 262144 |
+---------------------------------+----------------------+
基于这篇关于 [Mysql Database Performance Turning][3] 的文章,我相信我需要调整的值是
确定需要改进的领域 - MySQL 查询调整
matches
根据语句 [“您可能应该为正在选择、分组、排序或连接的任何字段创建索引。”][5] 工具
调整执行我计划使用
future 数据库大小
目标是构建一个系统,该系统可以拥有 1M 个唯一的 cluster_index 值、1M 个唯一的匹配值、大约 3,000,000,000 个表行,对查询的响应时间约为 0.5 秒(我们可以根据需要添加更多内存并在整个集群中分布数据库)
问题
链接
最佳答案
换 table
根据这篇关于 How to pick indexes for order by and group by queries 的帖子中的建议,表格现在看起来像
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
消除子查询
没有按 SUM(tfidf) 对结果进行排序的查询看起来像
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
这消除了使用临时和使用文件排序
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
排序问题
但是,如果我添加 ORDER BY SUM(tfdif)
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)
结果在这个规模上相当快但具有 ORDER BY SUM(tfidf) 意味着它使用临时和文件排序
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
可能的解决方案?
我正在寻找一个不使用临时或文件排序的解决方案,沿着
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;
我不需要硬编码总阈值,有什么想法吗?
关于caching - 18M+ 行表的子查询和 MySQL 缓存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4265544/