csv - AWK - 处理大型 CSV(130 亿行)以获取基于多列(复合键)的重复数据会导致内存不足错误

标签 csv awk

我有一个 CSV,其中包含 130 亿行,大小为 719GB。 CSV 中存在一些重复的行。 CSV 包含三列,示例数据如下:

tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"

复合键tagtime的每个实例都应该是唯一的。换句话说,一个标签在给定的时间可以有一个值。

我尝试过以下方法:

awk -F, '!seen[$1,$2]++' data.csv > data_UNIQUE.csv

由于内存不足错误,内核最终终止了上述进程。我的系统规范如下:

Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
128GB RAM
2TB NVME

如何使用 awk 成功处理此 CSV?

编辑: 所需的输出 CSV 不会有重复的数据,并且根据评论中的讨论,在传递到 awk 之前进行排序是有意义的,因此我们只查看相邻的行。

期望的输出:

tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"

最佳答案

使用任何版本的强制 Unix 工具 awk、sort 和 cut,这将按 2 个键值对输出进行排序:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN { FS=OFS="," }
    { print (NR>1), NR, $0 }
' "${@:--}" |
sort -t, -k1,1n -k3,4 -k2,2n |
cut -d, -f3- |
awk '
    BEGIN { FS=OFS="," }
    { key = $1 FS $2 }
    key != prev {
        print
        prev = key
    }
'

$ ./tst.sh file
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"

这将保留输出的输入顺序:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN { FS=OFS="," }
    { print (NR>1), NR, $0 }
' "${@:--}" |
sort -t, -k3,4 |
awk '
    BEGIN { FS=OFS="," }
    { key = $1 FS $3 FS $4 }
    key != prev {
        print
        prev = key
    }
' |
sort -t, -k1,1n -k2,2n |
cut -d, -f3-

$ ./tst.sh file
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"

我们使用 awk(打印 NR>1 )装饰输入,以将标题行( 0 )与其余部分( 1 )分开,而不是使用 head -n 1 test.csv && tail -n +2 test.csv | sort... 。因为后者需要打开输入文件两次,因此如果输入来自管道,则后者将不起作用。

我们还用 NR 进行装饰这样,给定 2 个重复的键,打印的值将是输入中出现的第一个值(或者我们可以反转该字段的排序,以便打印最后一个值(如果这是更可取的)。我们可以使用 GNU 排序来代替 -s但随后解决方案就变得不必要地仅限于 GNU。

关于csv - AWK - 处理大型 CSV(130 亿行)以获取基于多列(复合键)的重复数据会导致内存不足错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68999095/

相关文章:

python - 从庞大的邻接列表中提取边缘列表的最有效方法是什么?

mysql - 如何使用字符串中的引号将数据从 CSV 导入 MySQL?

python - 我提取总计的功能正在耗尽我的输入文件以供将来阅读

linux - 获取进程 ID

unix - 使用 awk 从 fasta 文件中选择一组序列的问题

Windows 上的 R : character encoding hell

bash - 清理 csv 的 Sed 命令不起作用

linux - 需要 grep/awk/gawk 返回整个部分,尽管有断线

bash - 如果找到数字,如何按模式在 bash 中拆分文件

linux - 使用列名称而不是数字来过滤条件