我有一个 csv 大文件 (>1GB) 位于网络文件存储中,每周更新一次新记录。该文件包含与这些类似的列:
Customer ID | Product | Online? (Bool) | Amount | Date
我需要使用此文件更新客户 ID 的 postgresql 数据库,其中包含每个月按产品和商店分类的总金额。像这样:
Customer ID | Month | (several unrelated fields) | Product 1 (Online) | Product 1 (Offline) | Product 2 (Online) | ect...
因为文件太大(并且随着每次更新而稳步变大),我需要一种有效的方法来获取更新的记录并更新数据库。不幸的是,我们的服务器按客户 ID 而不是日期更新文件,所以我无法跟踪它。
有没有一种聪明的方法来比较文件,使其不会随着文件不断增长而中断?
将文件复制到暂存表。这当然假设你有一个 PK,也就是每一行不改变的唯一标识符。我校验剩余的列和您已经加载到目标表中的行的总和,并将源与目标进行比较,这将找到更新、删除和新行。
如您所见,我没有添加任何索引或以任何其他方式对其进行调整。我的目标是让它正常运行。
create schema source;
create schema destination;
--DROP TABLE source.employee;
--DROP TABLE destination.employee;
select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO source.employee
from generate_series(1,10000000) x;
select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO destination.employee
from generate_series(1,10000000) x;
select
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM'
WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));
--Mimic source data getting an update.
UPDATE source.employee
SET age = 99
where employee_id = 45000;
select
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM'
WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));