python - 替代 "write to file"使用 COPY 将 CSV 数据传输到 PostgreSQL 以获得更好的性能？

我有一个包含 2500 行的 CSV 文件中的数据集。该文件的结构是(简化的)方式:

id_run；运行名称；受体1；受体2；受体 3_值； [...]; receptor50_value

文件的每个接受者已经在一个表中并且有一个唯一的id。

我需要将每一行上传到具有这种格式的表中:

id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1

实际上，我正在将需要上传的所有数据写入一个 .txt 文件，并使用 postgreSQL 的 COPY 命令将文件传输到目标表。

对于 2500 次运行(因此 CSV 文件中有 2500 行)和 50 个受体，我的 Python 程序在要上传的文本文件中生成约 110000 条记录。

我正在删除目标表的外键并在上传后恢复它们。

使用这种方法，生成文本文件实际上需要大约 8 秒，将文件复制到表格需要 1 秒。

有没有一种方法、方法、库或任何其他我可以用来加速上传数据的准备，以便 90% 的时间不需要用于编写文本文件？

编辑:

这是我的(更新的)代码。我现在正在使用批量写入文本文件。它看起来更快(在 3.8 秒内上传了 110 000 行)。

# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
    # the run_specs variable consists of the attributes defining a run 
    # (id_run, run_name, etc.). So basically a line in the CSV file without the 
    # receptors data
    sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
    for rec_i, rec in enumerate(rec_uids):
        # the rec_uids variable is the unique IDs in the database for the 
        # receptors in the CSV file
        line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
        lines.append(line_to_write)

# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
    tmp_data.writelines(lines)

# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys

# drop the foreign keys
for key in rr_fks:
    DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs

# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
                                                                    " ", False)

# restore foreign keys
for key in rr_fks:
    DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])

# commit to database
conn.commit()

编辑#2:

使用 cStringIO 库，我用类似文件的对象代替了临时文本文件的创建，但速度提升非常非常小。

代码更改:

outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
    outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))

cur.copy_from(outf, 'receptor_results')

最佳答案

是的，您可以采取一些措施来加快将数据提前写入文件的速度:别费心了!

您已经将数据装入内存，所以这不是问题。因此，不是将行写入字符串列表，而是将它们写入稍微不同的对象 - StringIO实例。然后数据可以保留在内存中和作为psycopg2的copy_from的参数。功能。

filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')

请注意，StringIO 必须包含换行符、字段分隔符等 - 就像文件一样。

关于python - 替代 "write to file"使用 COPY 将 CSV 数据传输到 PostgreSQL 以获得更好的性能？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37619208/

python - 替代 "write to file"使用 COPY 将 CSV 数据传输到 PostgreSQL 以获得更好的性能？

上一篇：postgresql - Postgres JSONB - 在两个 JSON 字段之间加入

下一篇： Entity Framework 中的 Postgresql 表继承