linux - BASH-查找相同大小的文件，使用 cksum，删除 dupe 但保留其名称作为符号链接(symbolic link)

祝你有美好的一天!我是一名实习生，刚刚开始学习 bash 脚本(感谢 bash.academy!)，并正在尝试创建一个执行以下操作的脚本；

查找相同大小的文件
通过 cksum 确定文件内容是否重复
删除重复文件，保留两个文件名，将删除的文件转换为剩余文件的符号链接(symbolic link)。

正在发生的事情的背景；有一个程序可以生成这些文件，并且它会大量生成重复项。它创建了一对包含完全相同的数据但名称不同的文件，并且我们有依赖重复文件的程序(同样，我们无法更改这些程序来检查非重复文件)，因此我必须将已删除的文件转换为符号链接(symbolic link)。我很感激任何建议，干杯!

`e#!/usr/bin/env bash
cd /path/to/files
ls -l -S | sort -k 5 -n #sort file sizes in revers order
cksum /path/to/files/* |        #File duplication verification
  awk ' { if( $2 in arr) 
            {print "duplicates ", $3, arr[$2], "duplicate filesize = ", $2} 
              else 
            {arr[$2]=$3} }' 
`

最佳答案

虽然问题不是很清楚，但我希望下面的脚本有所帮助:

#!/bin/bash
# Removing the duplicate files based on md5 hash based asscociative arrays
declare -A file_list # Note -A is for associative array
# The above associate array will have the below format
# file_list=([md5-hash]=filename)
duplicate_remover()
{
 md5_data=( $(md5sum "$1") )
 # md5sum gives the output in 'hash filename' format. See Reference 1
 check_exist=${file_list["X${md5_data[0]}"]+exists}
 # Above command check if the array element with the given key already exists in 'file_list' array.
 # We have used shell parameter expansion. See Reference 2
 if [ "$check_exist" = "exists" ]
 then
   ln -fs "${file_list["X${md5_data[0]}"]}" "$1"
   # Above steps turns duplicates to symbolic links. 
   # Note the '-f' with 'ln' forces rewrite if dest. file is already present
 else
   file_list+=(["X${md5_data[0]}"]="$1")
   # If the file is not already in the array, we add it using [key]=value construct.
 fi
}
#Our driver part below uses 'find' command to feed files into 'duplicate_remover' function
find . -maxdepth 1 -type f -print0 | while read -r -d '' filename
do
   duplicate_remover "$filename"
done

引用文献

参见md5sum manpage .
查看 shell 中 ${var+stuff} 的用法 parameter expansion .

注释

我假设所有文件都存在于同一目录中，如果没有从 find 中取出 -maxdepth 1。
首先使用 find 出现的文件将被保留，其余文件将转换为链接。

关于linux - BASH-查找相同大小的文件，使用 cksum，删除 dupe 但保留其名称作为符号链接(symbolic link)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38060741/

linux - BASH-查找相同大小的文件，使用 cksum，删除 dupe 但保留其名称作为符号链接(symbolic link)

上一篇：linux - 根据内容将巨大的哈希文件分成一个文件(shell脚本)

下一篇：linux - 从 Linux 读取向量 Read