linux - 从 INSANE BIG WORDLIST 中删除重复项

这样做的最佳方法是什么？这是一个 250GB 的文本文件，每行 1 个单词

输入:

想要的输出:

我需要为每个重复的行获取 1 个副本，如果有 2 个相同的行，我不想要，删除两个，只删除 1 个，始终保留 1 个唯一的行。

我现在做什么:

$ cat final.txt | sort | uniq > finalnoduplicates.txt

在屏幕上，这行得通吗？我不知道，因为当我检查输出文件的大小时，它是 0:

123user@instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep  3 13:59 final.txt
-rw-rw-r-- 1 123user            0 Sep  3 14:26 finalnoduplicates.txt
123user@instance-1:~$

但是当我检查运行此命令的屏幕的 htop cpu 值时，它是 100%。

我做错了什么吗？

最佳答案

您只需使用 sort 即可完成此操作。

$ sort -u final.txt > finalnoduplicates.txt

您可以进一步简化它，只需让 sort 完成所有操作即可:

$ sort -u final.txt -o finalnoduplicates.txt

最后，由于您的输入文件纯粹只是数字数据，您可以通过 -n 开关告诉 sort 以进一步提高此任务的整体性能:

$ sort -nu final.txt -o finalnoduplicates.txt

sort 的手册页

   -n, --numeric-sort
          compare according to string numerical value

   -u, --unique
          with -c, check for strict ordering; without -c, output only the
          first of an equal run

   -o, --output=FILE
          write result to FILE instead of standard output

关于linux - 从 INSANE BIG WORDLIST 中删除重复项，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52152703/

上一篇：mysql - 如何正确解决 mySQL 上的 ERROR 1044？这里的其他答案不起作用

下一篇：linux - 通过结束进程来停止 bash 脚本的最佳方法

相关文章：

Java 8 Streams - 分层排序嵌套列表

java - exec() 在 Debian 7 上不起作用

linux - 为什么 32 位 .deb 包不能安装在 64 位 Ubuntu 上？

linux - Apache 正在下载 perl 文件而不是显示它们

linux - 使用 ptrace 设置 RIP 的奇怪行为

javascript - “unsorting”数组 - 让数组恢复到 .sort javascript 之前的状态

Linux : Activating twice the same module with differents parameters

python - 以编程方式启用/禁用蓝牙配置文件

mysql - Linux bash MySQL 载入文件

java - 如何对 Guava 多图进行排序？ (关键=日期)