python - 用百万条记录更新表上的记录

标签 python sql django postgresql

我们有 9 列的表并且 pk 被索引。我们有 1.693 亿条记录,最多可达 2.5 亿条记录。每次我收到更新时,我都必须从数据库中获取大约 40,000 行,以便使用另一个索引列名称 fk 进行比较。在我处理之后我有:

pk_update_nc = [pk1, pk2, pk5, .....pk40000]
pk_update_PN = [pk3, pk4, pk6, .....pk35090]
new_rows = [[row1], [row2], [row3], [row40000]]

以上数据只是表明:

update table and set column status = 'NC' whose type is varying character(3) where pk in pk_update_nc update table and set column status = 'PN' whose type is varying character(3) where pk in pk_update_PN

insert rows to table from new_rows

更新和插入的最佳方式是什么?

方法一:

start_transaction:
for pk in pk_update_nc:
    update table set status='NC' where table.pk = pk
for pk in pk_update_PN:
    update table set status='PN' where table.pk = pk
for row in new_rows:
    insert into table row = row
commit

方法二:

start_transaction;
update table set status='NC' where table.pk in pk_update_nc;
update table set status='PN' where table.pk in pk_update_PN;
insert into table values rows
commit

方法三:

fill list of updated records list with rows instead of complete records,
insert all records to table
start_transaction:
delete from table where fk = fk_provided;
insert all rows, updated + new using \copy or django bulk create
commit;
  • Explanation as requested for third method.* This mean fetch rows from database and process locally which is normal in every method, now instead of update database, we change old records consider them new. delete all records from database having fk which is indexed column, then insert all records as new using \copy. \copy insert records magically fast. for \copy visit postgresql COPY-TO/COPY-FROM

方法4? 建议

常见问题解答:

Why should I pull 40,000 rows from db?

It's because we have to process these records against new records to form status of old&new records, old rows are passed from many use cases to final their status. This conclude multiple hits for each row and impact performance hit. that's why I decided to pull data and process locally before final update. Now we want minimal possible hit of performance to update db.

concurrency problems:

We resolve this by locking sheet to be processed. and next sheet for same records is locked to be processed until previous task is completed. this restrict users to process same fk sheet to be processed simultaneously. Questions on database could be, Should I lock database while updating and processing which could take up-to 1-2 minutes? database could be locked for update only which takes lesser time.

tools:

  • psql postgresql 9.1 python 2.7 django 1.5

最佳答案

你正试图给 pig 涂口红。

检索 40k 行、在某些客户端应用程序中操作它们并将它们写回是极度效率低下的。最重要的是,您很容易在多用户环境中遇到并发问题。如果在您处理应用程序中的数据时数据库发生了变化怎么办?如何解决此类冲突?

执行此操作的正确方法(如果可能的话)是使用基于集合的操作在数据库中执行此操作。

Data-modifying CTEs对于处理多个表中的数据的更复杂的操作特别有用。 This search here on SO举几个例子。

关于python - 用百万条记录更新表上的记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20904437/

相关文章:

python - 如何更改 QPushButton 文本和背景颜色

python - 如何销毁 Python 对象并释放内存

python - 自动为每个文件夹创建数据框

mysql - 查找最小值并包含正确的列值

sql - 选择两个列值之间的日期

mysql - 将 Django 模型部署到 Web 服务器后,如何对其进行更改?

Python游戏编程: is my IO object a legitimate candidate for being a global variable?

sql - 如何使用更新表上的多个联接更新 SQL Server 上的表? (从加入)

python - Django 中 ugettext_noop 的用途是什么?

python - 如何解决在 Linux Mint 18.1 中找不到 Django 'setup.py'?