mysql - 优化 SELECT count(DISTINCT ip)

标签 mysql sql optimization distinct

我正在尝试从一个每天有大约 200 万个新行的表中获取汇总结果(唯一 IP 总数)。

表格:

CREATE TABLE `clicks` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `hash` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `type` enum('popunder','gallery','exit','direct') COLLATE utf8_unicode_ci NOT NULL,
  `impression_time` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `source_user_id` int(11) NOT NULL,
  `destination_user_id` int(11) NOT NULL,
  `destination_campaign_id` int(11) NOT NULL,
  `destination_campaign_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `destination_campaign_url` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `ip` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `referrer` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country_code` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `isp` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `category_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `category` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `bid` float(8,2) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  KEY `ip` (`ip`),
  KEY `source_user_id` (`source_user_id`),
  KEY `destination_user_id` (`destination_user_id`),
  KEY `destination_campaign_id` (`destination_campaign_id`),
  KEY `clicks_hash_index` (`hash`),
  KEY `clicks_created_at_index` (`created_at`),
  KEY `campaign_date` (`destination_campaign_id`,`created_at`),
  KEY `source_user_date` (`source_user_id`,`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=301539660 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

我的查询:

SELECT SUM(ips_by_date.count) as count, ips_by_date.date as date
FROM (SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
      FROM clicks as clicks 
      WHERE created_at BETWEEN '2016-05-22 00:00:00' AND '2016-05-23 23:59:59' 
      GROUP BY DATE(created_at)) as ips_by_date 
GROUP BY date;

现在,这个查询只用了 93 秒就运行了一天,我觉得我错过了什么。

我可以做任何优化来加快这个简单计数的性能吗?

谢谢。

最佳答案

首先,我不明白为什么需要子查询。内部查询每个日期一行。无需再次聚合。其次,你的查询是两天,但我得到了关于性能的分数。

那么,让我们开始:

SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
FROM clicks  
WHERE created_at BETWEEN '2016-05-22 00:00:00' AND '2016-05-23 23:59:59' 
GROUP BY DATE(created_at);

对于此查询,您需要一个关于clicks(created_at, ip) 的索引。另请注意,我会将其写为:

SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
FROM clicks  
WHERE created_at >= '2016-05-22' AND created_at < '2016-05-24' 
GROUP BY DATE(created_at);

这应该显示出一些改进,但我认为它不会从根本上变得更好,因为外部聚合仍然需要文件排序。

关于mysql - 优化 SELECT count(DISTINCT ip),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37381066/

相关文章:

php - 处理大写撇号时出现问题

php - isset 不适用于表单提交(实际上是数据库问题)

php - 安全图片上传 PHP

mysql - SQL : should I insert another column or parse every single row

c++ - 当模板类型是基本类型时通过引用传递的成本

mysql query - 峰值并发调用 CDR 数据

sql - 标签排列的高效查询生成

mysql - 超过 8 个表的特定汽车行业 SQL 查询(mysql)

flutter - 何时在 Flutter 中使用 FutureBuilder

mysql - 在 MySql 中一次获取 15 个表结果的更好方法