python - 用百万条记录更新表上的记录

我们有 9 列的表并且 pk 被索引。我们有 1.693 亿条记录，最多可达 2.5 亿条记录。每次我收到更新时，我都必须从数据库中获取大约 40,000 行，以便使用另一个索引列名称 fk 进行比较。在我处理之后我有:

pk_update_nc = [pk1, pk2, pk5, .....pk40000]
pk_update_PN = [pk3, pk4, pk6, .....pk35090]
new_rows = [[row1], [row2], [row3], [row40000]]

以上数据只是表明:

update table and set column status = 'NC' whose type is varying character(3) where pk in pk_update_nc update table and set column status = 'PN' whose type is varying character(3) where pk in pk_update_PN

insert rows to table from new_rows

更新和插入的最佳方式是什么？

方法一:

start_transaction:
for pk in pk_update_nc:
    update table set status='NC' where table.pk = pk
for pk in pk_update_PN:
    update table set status='PN' where table.pk = pk
for row in new_rows:
    insert into table row = row
commit

方法二:

start_transaction;
update table set status='NC' where table.pk in pk_update_nc;
update table set status='PN' where table.pk in pk_update_PN;
insert into table values rows
commit

方法三:

fill list of updated records list with rows instead of complete records,
insert all records to table
start_transaction:
delete from table where fk = fk_provided;
insert all rows, updated + new using \copy or django bulk create
commit;

Explanation as requested for third method.* This mean fetch rows from database and process locally which is normal in every method, now instead of update database, we change old records consider them new. delete all records from database having fk which is indexed column, then insert all records as new using \copy. \copy insert records magically fast. for \copy visit postgresql COPY-TO/COPY-FROM

方法4？建议

常见问题解答:

Why should I pull 40,000 rows from db?

It's because we have to process these records against new records to form status of old&new records, old rows are passed from many use cases to final their status. This conclude multiple hits for each row and impact performance hit. that's why I decided to pull data and process locally before final update. Now we want minimal possible hit of performance to update db.

concurrency problems:

We resolve this by locking sheet to be processed. and next sheet for same records is locked to be processed until previous task is completed. this restrict users to process same fk sheet to be processed simultaneously. Questions on database could be, Should I lock database while updating and processing which could take up-to 1-2 minutes? database could be locked for update only which takes lesser time.

tools:

psql postgresql 9.1 python 2.7 django 1.5

最佳答案

你正试图给 pig 涂口红。

检索 40k 行、在某些客户端应用程序中操作它们并将它们写回是极度效率低下的。最重要的是，您很容易在多用户环境中遇到并发问题。如果在您处理应用程序中的数据时数据库发生了变化怎么办？如何解决此类冲突？

执行此操作的正确方法(如果可能的话)是使用基于集合的操作在数据库中执行此操作。

Data-modifying CTEs对于处理多个表中的数据的更复杂的操作特别有用。 This search here on SO举几个例子。

关于python - 用百万条记录更新表上的记录，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20904437/

python - 用百万条记录更新表上的记录

上一篇：spring - 事务可重复读取隔离在 PostgreSQL 中无法正常工作

下一篇：node.js - Node Orm2 交易插件