我有一个 CSV,其中包含 130 亿行,大小为 719GB。 CSV 中存在一些重复的行。 CSV 包含三列,示例数据如下:
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
复合键tag
和time
的每个实例都应该是唯一的。换句话说,一个标签
在给定的时间
可以有一个值。
我尝试过以下方法:
awk -F, '!seen[$1,$2]++' data.csv > data_UNIQUE.csv
由于内存不足
错误,内核最终终止了上述进程。我的系统规范如下:
Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
128GB RAM
2TB NVME
如何使用 awk 成功处理此 CSV?
编辑: 所需的输出 CSV 不会有重复的数据,并且根据评论中的讨论,在传递到 awk 之前进行排序是有意义的,因此我们只查看相邻的行。
期望的输出:
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
最佳答案
使用任何版本的强制 Unix 工具 awk、sort 和 cut,这将按 2 个键值对输出进行排序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, $0 }
' "${@:--}" |
sort -t, -k1,1n -k3,4 -k2,2n |
cut -d, -f3- |
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $2 }
key != prev {
print
prev = key
}
'
$ ./tst.sh file
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
这将保留输出的输入顺序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, $0 }
' "${@:--}" |
sort -t, -k3,4 |
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $3 FS $4 }
key != prev {
print
prev = key
}
' |
sort -t, -k1,1n -k2,2n |
cut -d, -f3-
$ ./tst.sh file
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
我们使用 awk(打印 NR>1
)装饰输入,以将标题行( 0
)与其余部分( 1
)分开,而不是使用 head -n 1 test.csv && tail -n +2 test.csv | sort...
。因为后者需要打开输入文件两次,因此如果输入来自管道,则后者将不起作用。
我们还用 NR
进行装饰这样,给定 2 个重复的键,打印的值将是输入中出现的第一个值(或者我们可以反转该字段的排序,以便打印最后一个值(如果这是更可取的)。我们可以使用 GNU 排序来代替 -s
但随后解决方案就变得不必要地仅限于 GNU。
关于csv - AWK - 处理大型 CSV(130 亿行)以获取基于多列(复合键)的重复数据会导致内存不足错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68999095/