php - MySQL:在不破坏外键约束的情况下消除重复行

标签 php mysql sql normalization duplicate-removal

我有一个充满标准化地址的客户数据库。有重复。

每个用户都创建了自己的记录,并输入了自己的地址。所以我们在用户和地址之间有一对一的关系:

CREATE TABLE `users` (
  `UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Name` VARCHAR(63),
  `Email` VARCHAR(63),
  `AddressID` INT UNSIGNED,
  PRIMARY KEY (`UserID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `addresses` (
  `AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Duplicate` VARCHAR(1),
  `Address1` VARCHAR(63) DEFAULT NULL,
  `Address2` VARCHAR(63) DEFAULT NULL,
  `City` VARCHAR(63) DEFAULT NULL,
  `State` VARCHAR(2) DEFAULT NULL,
  `ZIP` VARCHAR(10) DEFAULT NULL,
  PRIMARY KEY (`AddressID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  

和数据:

INSERT INTO `users` VALUES
    (1,  'Michael', 'michael@email.com', 1),
    (2,  'Steve',   'steve@email.com',   2),
    (3,  'Judy',    'judy@email.com',    3),
    (4,  'Kathy',   'kathy@email.com',   4),
    (5,  'Mark',    'mark@email.com',    5),
    (6,  'Robert',  'robert@email.com',  6),
    (7,  'Susan',   'susan@email.com',   7),
    (8,  'Paul',    'paul@email.com',    8),
    (9,  'Patrick', 'patrick@email.com', 9),
    (10, 'Mary',    'mary@email.com',    10),
    (11, 'James',   'james@email.com',   11),
    (12, 'Barbara', 'barbara@email.com', 12),
    (13, 'Peter',   'peter@email.com',   13);


INSERT INTO `addresses` VALUES
    (1,  '',  '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (2,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (3,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (4,  '',  '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (5,  '',  '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (6,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (7,  'Y', '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (8,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (9,  '',  '1000 State Street',  'Apt A', 'Sunnydale',   'OH', '54321'),
    (10, 'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (11, 'Y', '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (12, 'Y', '1000 Main Street',   'Apt A', 'Sunnydale',   'OH', '54321'),
    (13, '',  '9999 Valleyview',    '',      'Springfield', 'KS', '54321');

哦,是的,让我添加外键关系:

ALTER TABLE `users` ADD CONSTRAINT `AddressID` 
FOREIGN KEY `AddressID` (`AddressID`)
REFERENCES `addresses` (`AddressID`);

我们让第 3 方服务清理了我们的地址列表,该服务规范化了数据并指出了重复项。这就是 Duplicate 列的来源。如果有“Y”,则它是另一个地址的副本。如示例数据所示,主地址未标记为重复。

我显然想删除所有重复记录,但有指向它们的用户记录。我需要它们指向不重复的地址版本。

那么如何更新 users 中的 AddressID 以匹配非重复地址?

我能想到的唯一方法是使用高级语言遍历所有数据,但我相当确定 MySQL 拥有以更好的方式执行此类操作所需的所有工具。

这是我尝试过的:

SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids
FROM addresses
GROUP BY Address1, Address2, City, State, ZIP
HAVING cnt > 1;

+-----+--------------+
| cnt | ids          |
+-----+--------------+
|   2 | 5,7          |
|   6 | 1,2,3,6,8,10 |
|   2 | 4,11         |
+-----+--------------+
3 rows in set (0.00 sec)

从那里,我可以遍历每个结果行并执行此操作:

UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);

但必须有更好的 MySQL-only 方式,不是吗?

说完一切后,数据应该如下所示:

SELECT * FROM `users`;
+--------+---------+-------------------+-----------+
| UserID | Name    | Email             | AddressID |
+--------+---------+-------------------+-----------+
|      1 | Michael | michael@email.com |         1 |
|      2 | Steve   | steve@email.com   |         1 |
|      3 | Judy    | judy@email.com    |         1 |
|      4 | Kathy   | kathy@email.com   |         4 |
|      5 | Mark    | mark@email.com    |         5 |
|      6 | Robert  | robert@email.com  |         1 |
|      7 | Susan   | susan@email.com   |         5 |
|      8 | Paul    | paul@email.com    |         1 |
|      9 | Patrick | patrick@email.com |         9 |
|     10 | Mary    | mary@email.com    |         1 |
|     11 | James   | james@email.com   |         4 |
|     12 | Barbara | barbara@email.com |         1 |
|     13 | Peter   | peter@email.com   |        13 |
+--------+---------+-------------------+-----------+
13 rows in set (0.00 sec)

SELECT * FROM `addresses`;
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| AddressID | Duplicate | Address1           | Address2 | City        | State | ZIP   |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
|         1 |           | 1234 Main Street   |          | Springfield | KS    | 54321 |
|         4 |           | 5678 Sycamore Lane |          | Upstate     | NY    | 50000 |
|         5 |           | 1000 State Street  | Apt C    | Sunnydale   | OH    | 54321 |
|         9 |           | 1000 State Street  | Apt A    | Sunnydale   | OH    | 54321 |
|        13 |           | 9999 Valleyview    |          | Springfield | KS    | 54321 |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
5 rows in set (0.00 sec)

帮忙吗?

最佳答案

用户和地址之间存在多对一关系(即多个用户可以映射到同一地址)。这对我来说似乎有点奇怪,但我想它可能会有用。多对多会更有意义,即一个用户可以有多个地址,但同一个地址可以由多个用户共享。通常,单个用户有多个地址。更新您的架构可能会有所帮助,但我离题了。

UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
    -- Compare to other columns if you want
    AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

关于php - MySQL:在不破坏外键约束的情况下消除重复行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20232799/

相关文章:

php - CodeIgniter SQL 查询缺少列

php - jQuery 无法从 href 属性获取文件部分

mysql - 获取每组中元素之间的最小差异

php - 如何获取目录下的所有csv文件名?

javascript - 启用和禁用所需的 Angular JS

MySQL 查询 - 分组值和聚合

php - 传递此查询以从表中获取登录名时出现错误

php - 我的第一个 CakePHP 项目中未找到列错误

php - 在 MYSQL 中使用单个查询获取 3 个不同 SUM 的 "best"方法是什么

sql - 更新 SQL Server 2008 登录 SID