python - 写入 csv 时如何保留空值

我正在使用 Python 的 csv 模块将数据从 sql server 写入 csv 文件，然后使用复制命令将 csv 文件上传到 postgres 数据库。问题是 Python 的 csv 编写器会自动将 Nulls 转换为空字符串“”，当列是 int 或 float 数据类型时，它会失败我的工作，并且它会尝试在它应该是 None 或 null 值时插入这个“”。

To make it as easy as possible to interface with modules which implement the DB API, the value None is written as the empty string.

https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.writer

保留空值的最佳方法是什么？有没有更好的方法用 Python 编写 csvs？我愿意接受所有建议。

例子:

我有经纬度值:

42.313270000    -71.116240000
42.377010000    -71.064770000
NULL    NULL

写入 csv 时，它会将空值转换为“”:

with file_path.open(mode='w', newline='') as outfile:
    csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
    if include_headers:
        csv_writer.writerow(col[0] for col in self.cursor.description)
    for row in self.cursor:
        csv_writer.writerow(row)

42.313270000,-71.116240000
42.377010000,-71.064770000
"",""

NULL

Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.

https://www.postgresql.org/docs/9.2/sql-copy.html

回答:

为我解决问题的方法是将引号更改为 csv.QUOTE_MINIMAL。

csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.

最佳答案

这里有两个选择:更改 Python 中的 csv.writing 引号选项，或者告诉 PostgreSQL 接受带引号的字符串作为可能的 NULL(需要 PostgreSQL 9.4 或更新版本)

Python `csv.writer()` 和引用

在 Python 方面，您告诉 csv.writer() 对象添加引号，因为您将其配置为使用 csv.QUOTE_NONNUMERIC :

Instructs writer objects to quote all non-numeric fields.

None 值是非数字的，因此导致写入 ""。

切换到使用 csv.QUOTE_MINIMAL或 csv.QUOTE_NONE :

csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.

csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.

由于您所写的只是经度和纬度值，因此您不需要在此处进行任何引号，因为您的数据中不存在定界符或引号字符。

无论选择哪个选项，None 值的 CSV 输出都是一个简单的空字符串:

>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
...     outfile = StringIO()
...     csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
...     csv_writer.writerows(rows)
...     return outfile.getvalue()
...
>>> rows = [
...     [42.313270000, -71.116240000],
...     [42.377010000, -71.064770000],
...     [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""

>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,

>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,

PostgreSQL 9.4 `COPY FROM`、`NULL` 值和 `FORCE_NULL`

从 PostgreSQL 9.4 开始，当您使用 FORCE_NULL 选项时，您还可以强制 PostgreSQL 接受带引号的空字符串作为 NULL。来自COPY FROM documentation :

FORCE_NULL

Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.

在 FORCE_NULL 选项中命名列让 PostgreSQL 接受空列和 "" 作为这些列的 NULL 值，例如:

COPY position (
    lon, 
    lat
) 
FROM "filename"
WITH (
    FORMAT csv,
    NULL '',
    DELIMITER ',',
    FORCE_NULL(lon, lat)
);

此时您在 Python 端使用什么引用选项不再重要。

要考虑的其他选项

对于来自其他数据库的简单数据转换任务，不要使用 Python

如果您已经查询数据库以整理数据以进入 PostgreSQL，请考虑直接插入 Postgres。如果数据来自其他来源，则使用 foreign data wrapper (fdw) module让您省去中间人，直接从其他来源将数据拉入 PostgreSQL。

Numpy 数据？考虑直接从 Python 使用 COPY FROM 作为二进制文件

通过 binary COPY FROM 可以更有效地插入 Numpy 数据;链接的答案用所需的额外元数据和字节顺序扩充了一个 numpy 结构化数组，然后有效地创建了数据的二进制副本并使用 COPY FROM STDIN WITH BINARY 和 psycopg2.copy_expert() method 将其插入到 PostgreSQL 中.这巧妙地避免了数字 -> 文本 -> 数字转换。

持久化数据以处理管道中的大型数据集？

不要重新发明数据管道轮子。考虑使用现有项目，例如 Apache Spark ，这已经解决了效率问题。 Spark 让你 treat data as a structured stream ，并包括 run data analysis steps in parallel 的基础设施, 你可以治疗 distributed, structured data as Pandas dataframes .

另一种选择可能是查看 Dask帮助在分布式任务之间共享数据集以处理大量数据。

即使将一个已经在运行的项目转换为 Spark 可能有点过分，至少考虑使用 Apache Arrow ，数据交换平台 Spark 构建于其之上。 pyarrow project会让您通过 Parquet 文件或 exchange data over IPC 交换数据.

Pandas 和 Numpy 团队在支持 Arrow 和 Dask 的需求方面投入了大量资金(这些项目之间的核心成员有相当大的重叠)，并积极致力于使 Python 数据交换尽可能高效，包括 extending Python's pickle module to allow for out-of-band data streams避免共享数据时不必要的内存复制。

关于python - 写入 csv 时如何保留空值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54816169/

python - 写入 csv 时如何保留空值

Python `csv.writer()` 和引用

PostgreSQL 9.4 `COPY FROM`、`NULL` 值和 `FORCE_NULL`

要考虑的其他选项

对于来自其他数据库的简单数据转换任务，不要使用 Python

Numpy 数据？考虑直接从 Python 使用 COPY FROM 作为二进制文件

持久化数据以处理管道中的大型数据集？

上一篇：python - 在 lambda 中使用 iadd？

下一篇：python - 从列表列表中删除列表 Python

python - 写入 csv 时如何保留空值

Python csv.writer() 和引用

PostgreSQL 9.4 COPY FROM、NULL 值和 FORCE_NULL

要考虑的其他选项

对于来自其他数据库的简单数据转换任务，不要使用 Python

Numpy 数据？考虑直接从 Python 使用 COPY FROM 作为二进制文件

持久化数据以处理管道中的大型数据集？

上一篇：python - 在 lambda 中使用 __iadd__？

下一篇：python - 从列表列表中删除列表 Python

Python `csv.writer()` 和引用

PostgreSQL 9.4 `COPY FROM`、`NULL` 值和 `FORCE_NULL`

上一篇：python - 在 lambda 中使用 iadd？