mysql - 在过滤不活跃用户(即没有任何交易)后,从拥有约 1000 万用户和交易的系统中获取旧的用户余额

标签 mysql optimization data-retrieval

这看起来更像是一个 db.stackexchange 问题,但放在这里是因为也可能有脚本解决方案。请原谅这个问题缺乏结构。

涉及的表 -

帐户

CREATE TABLE `account` (
  `id` bigint(15) NOT NULL AUTO_INCREMENT,
  `account_id` bigint(14) NOT NULL,
  `acc_complete_id` bigint(14) DEFAULT NULL,
  `uuid` varchar(400) NOT NULL,
  `type` int(11) DEFAULT NULL,
  `created` datetime DEFAULT NULL,
  `balance` decimal(19,2) DEFAULT '0.00',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uuid_UNIQUE` (`uuid`),
  UNIQUE KEY `account_id_UNIQUE` (`account_id`),
  UNIQUE KEY `acc_complete_id_UNIQUE` (`acc_complete_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

交易

CREATE TABLE `transaction` (
  `id` bigint(19) NOT NULL AUTO_INCREMENT,
  `type` int(4) DEFAULT NULL,
  `created` datetime DEFAULT NULL,
  `amount` decimal(19,2) DEFAULT '0.00',
  `debit` bigint(14) DEFAULT NULL,
  `credit` bigint(14) DEFAULT NULL,
  `status` varchar(45) DEFAULT NULL,
  `debit_bal` decimal(19,2) DEFAULT '0.00',
  `credit_bal` decimal(19,2) DEFAULT '0.00',
  PRIMARY KEY (`id`),
  KEY `transaction_credit_index` (`credit`),
  KEY `transaction_debit_index` (`debit`),
  KEY `transaction_created_index` (`created`),
  KEY `transaction_ref_index` (`ref`),
  KEY `transaction_narrative_index` (`narrative`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

表格列

  • debit_bal 和credit_bal 是交易中涉及的两个账户在交易之后的余额。

我们目前发现余额为零的非活跃用户总数(非活跃用户数取决于谁拥有零余额) 不在一定时期内进行交易。 但现在,痛苦的部分是,我们需要获取过去几个月的数据(这段时间的不活动和帐户余额)

当前使用查询来获取余额为零的非事件用户数量以及创建日期、类型等的一些条件 -

SELECT
  count(DISTINCT( a.uuid )),
  Sum(a.balance) 
FROM
  account a 
WHERE
  a.balance = 0.00 and a.type = "1" 
  AND a.created <= '2018-02-28 18:29:59' 
  AND 
  (
    a.account_id + 100000000000 
  )
  NOT IN 
  (
    SELECT DISTINCT
( pt.debit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.debit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59' 
      )
    UNION
    SELECT DISTINCT
( pt.credit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.credit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59' 
      )
  )

以上查询返回大约 10 分钟内有余额的非活跃用户数量。

非事件用户 = 用户 -(已借记 UNION 的用户组,已贷记的用户)。

但是,我无法对较早的月份运行此查询,因为我获得的值将基于当前余额,而当时的余额并不相同。帐户类型也可能不相同,但我们找到了这些类型并在重复的表中进行了更新。

现在,当我尝试通过删除 count() 并最后按 uuid 添加组来获取不活动用户计数以及当前余额时,查询运行超过 15 小时,并且 mysql 线程状态显示大部分时间显示“删除重复项”。

解释输出 -

+----+--------------------+------------+------------+-------------+------------------------------------------------------------------------------------------+----------------------------------+---------+------+----------+----------+----------------------------------------------+
| id | select_type        | table      | partitions | type        | possible_keys                                                                            | key                              | key_len | ref  | rows     | filtered | Extra                                        |
+----+--------------------+------------+------------+-------------+------------------------------------------------------------------------------------------+----------------------------------+---------+------+----------+----------+----------------------------------------------+
|  1 | PRIMARY            | a          | NULL       | ALL         | PRIMARY,uuid_UNIQUE,account_id_UNIQUE,acc_complete_id_UNIQUE,created_index,updated_index | NULL                             | NULL    | NULL | 23745634 |     5.00 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | pt         | NULL       | ref_or_null | transaction_debit_index,transaction_created_index                        | transaction_debit_index  | 9       | func |       32 |     7.52 | Using where                                  |
|  2 | DEPENDENT SUBQUERY | pa         | NULL       | eq_ref      | account_id_UNIQUE,created_index                                                          | account_id_UNIQUE                | 8       | func |        1 |     5.00 | Using index condition; Using where           |
|  4 | DEPENDENT UNION    | pt         | NULL       | ref_or_null | transaction_credit_index,transaction_created_index                       | transaction_credit_index | 9       | func |       22 |     7.52 | Using where                                  |
|  4 | DEPENDENT UNION    | pa         | NULL       | eq_ref      | account_id_UNIQUE,created_index                                                          | account_id_UNIQUE                | 8       | func |        1 |     5.00 | Using index condition; Using where           |
| NULL | UNION RESULT       | <union2,4> | NULL       | ALL         | NULL                                                                                     | NULL                             | NULL    | NULL |     NULL |     NULL | Using temporary                              |
+----+--------------------+------------+------------+-------------+------------------------------------------------------------------------------------------+----------------------------------+---------+------+----------+----------+----------------------------------------------+
6 rows in set, 1 warning (0.00 sec)

现在,我需要获取用户列表,这需要花费大量时间 -

       SELECT DISTINCT
( a.uuid ),
  Sum(a.balance) 
FROM
  account a 
WHERE
  a.type = "1" 
  AND a.created <= '2018-02-28 18:29:59' 
  AND 
  (
    a.account_id + 100000000000 
  )
  NOT IN 
  (
    SELECT DISTINCT
( pt.debit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.debit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59'
      )
    UNION
    SELECT DISTINCT
( pt.credit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.credit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59'
      )
  )
GROUP BY
  a.id;

这花了将近 15 个小时,而且仍在继续。这太长了,因为我需要这样做几个月,任何错误都意味着我需要再次运行。

一些示例数据

一些示例数据 -

账户表 -

+------+------------+-----------------+---------------------+---------------------+---------------------+---------+
| id   | account_id | acc_complete_id | uuid                | last_updated        | created             | balance |
+------+------------+-----------------+---------------------+---------------------+---------------------+---------+
|   29 |      50536 |    100000050536 | 1026651502611722400 | 2020-01-09 12:43:49 | 2018-01-01 00:00:01 | 2092.10 |
| 1337 |      53071 |    100000053071 | 7266704751953077361 | 2019-12-26 11:45:54 | 2019-10-22 18:13:21 |   99.00 |
|   30 |      50673 |    100000050673 | 8799857402485889540 | 2020-01-05 13:21:16 | 2017-01-01 00:00:01 | 2166.10 |
+------+------------+-----------------+---------------------+---------------------+---------------------+---------+

交易

+---------+---------------------+--------+--------------+--------------+-----------+------------+
| id      | created             | amount | debit        | credit       | debit_bal | credit_bal |
+---------+---------------------+--------+--------------+--------------+-----------+------------+
| 2001705 | 2019-12-07 14:14:18 |   1.00 | 100000050536 |            3 |   2092.00 | 2332445.91 |
| 2001869 | 2020-05-08 14:29:00 |   4.00 | 100000050673 | 200000052870 |   2088.10 |       4.00 |
| 2001874 | 2020-05-09 14:45:04 |   4.00 | 100000050673 | 200000052870 |   2084.10 |       8.00 |
| 2001875 | 2020-05-09 14:46:37 |   4.00 | 100000050673 | 200000052870 |   2080.10 |      12.00 |
| 2002018 | 2019-11-29 18:05:41 |  50.00 | 100000053071 | 300000050673 |      0.00 |    2170.10 |
| 2002019 | 2019-11-29 18:07:41 |   1.00 | 100000053071 | 300000050673 |    100.00 |    2170.10 |
| 2002020 | 2019-11-29 18:07:56 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002021 | 2019-11-29 18:15:22 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002022 | 2019-11-29 18:18:45 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002023 | 2019-11-29 18:20:41 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002024 | 2019-11-29 18:24:18 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002025 | 2019-11-29 18:26:19 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002026 | 2019-11-29 18:28:41 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002027 | 2019-11-29 18:29:37 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002028 | 2019-11-29 18:30:40 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002029 | 2019-11-29 18:35:55 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002030 | 2019-11-29 18:42:16 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002031 | 2019-12-02 13:12:01 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002032 | 2019-12-02 13:18:21 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002033 | 2019-12-02 13:27:53 |   1.00 | 100000053071 |            5 |    100.00 |  580037.00 |
| 2002034 | 2019-12-02 13:38:11 |   1.00 | 100000053071 |            5 |     99.00 |  580038.00 |
+---------+---------------------+--------+--------------+--------------+-----------+------------+

总结

  • 因此,我必须从此处获取用户列表及其当前余额。这是瓶颈,我无法想到分解这部分来推导出最终的结果。

  • 一旦我获得了此类用户的当前余额列表,我就可以在接下来的几个月中查询每个用户的另一个借方和贷方交易总额表,然后进行一些加法和减法以得出旧余额每个用户的信息,然后将它们相加即可找到所有此类用户。在给定月份内使用 txns 的用户数量几乎没有两位数,因此这部分发生得很快。

我现在正在考虑获取数据的替代方案。 请注意,我们已经隔离了这些表,并且现在没有实时流量,因此我们可以根据需要添加更多索引。

我没有太多时间来尝试很多方法,但我接下来想到的是向帐户表添加标志字段,例如“nov_inactive”、“dec_inactive”等,表示用户所在的位置该月内不活跃。我想尝试使用相同的选择标准更新重复的表也会花费类似的时间 -

update
  account_copy 
set
  nov_updates = 
  (
    1
  )
WHERE
  a.type = "1" 
  AND a.created <= '2018-02-28 18:29:59' 
  AND 
  (
    a.account_id + 100000000000 
  )
  NOT IN 
  (
    SELECT DISTINCT
( pt.debit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.debit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59'
      )
    UNION
    SELECT DISTINCT
( pt.credit ) 
    FROM
      transaction pt 
    WHERE
      pt.created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59' 
      AND MOD(pt.credit, 100000000000) IN 
      (
        SELECT
          pa.account_id 
        FROM
          account pa 
        WHERE
          pa.type = "1" 
          AND pa.created <= '2018-02-28 18:29:59'
      )
  )
GROUP BY
  a.id;

有什么想法吗?

最佳答案

这就是我们计算旧余额的方式 -

select count(account_id), sum(if(temp2.old_balance is null,temp1.balance, temp2.old_balance)) 
from
         (
          select 
           pa.account_id, pa.balance, temp.acc_id as acc_id from account as pa force index (created_index)
           left join
                   ((select mod(debit,100000000000) as acc_id from transaction where created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59') 
                                  union 
                     (select mod(credit,100000000000) as acc_id from transaction where created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59')
                    ) as temp
           on pa.account_id=temp.acc_id
           where pa.type = '1' AND pa.created <= '2018-02-28 18:29:59'
           having acc_id is null
          )
  as temp1
  left join 
  (
      select temp.acc_id,temp.txn_amt,b.balance,(b.balance-temp.txn_amt) as old_balance from  
      (
         select mod(temp.acc_id,100000000000) as acc_id, sum(if(type=1,temp.amount,0-temp.amount)) as txn_amt from 
         (
              select credit as acc_id,sum(amount) as amount, '1' as type from transaction where created > '2019-11-30 18:29:59' and status= "SUCCESSFUL" group by credit 
              UNION
              select debit as acc_id, sum(amount) as amount, '0' as type from transaction where created > '2019-11-30 18:29:59' and status= "SUCCESSFUL" group by debit
          ) as temp group by temp.acc_id
      ) as temp join account as b on temp.acc_id=b.account_id where b.created <= '2018-02-28 18:29:59' and type='1'
   ) as temp2 
   on temp1.account_id=temp2.acc_id

在 temp1 别名中,我们获取当前余额,在 temp2 别名中,我们获取报告月份后进行交易的用户的旧余额。

关于mysql - 在过滤不活跃用户(即没有任何交易)后,从拥有约 1000 万用户和交易的系统中获取旧的用户余额,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59697168/

相关文章:

c++ - 虚函数和性能 - C++

angularjs - Firebase 和 AngularJS,如何从 firebaseArray 检索值

javascript - 如何使用 PHP 从服务器文件夹中检索文件并使用 javascript 在网页上显示/下载它?

php - 错误 : Object of class stdClass could not be converted to string -- implode() -- php

php - 网页将多条数据插入数据库

php - 我们如何使用 php 将两个数据库合并到新数据库中

c - 用C程序从PostgreSQL中检索数据

php查询排序最多或最多的单词

python-3.x - 如何正确猜测LogLog图线性回归中的初始点?

MySQL:仅从查询数据集中获取条件行