linux - 如何获得一系列文件中具有最大差异的两个文件

我有一系列 .csv 文件，其中包含由空格分隔的柱状(5 列)数据。文件名的格式为“yyyymmdd.csv”。文件格式举例如下:

20161201.csv的内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value

20161202.csv内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value

20161203.csv的内容

key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value

我想根据值列将日期为“D”的文件与日期为“D+1”的文件进行比较。然后我对最大行数不同的两个连续文件感兴趣。所以像这里一样，如果我将 20161201.csv 与 20161202.csv 进行比较，我只会得到第二行不匹配

(123457 20000 一些值和 123457 80000 一些值，因为 20000 不匹配!= 80000)

然后如果我将 20161202.csv 与 20161203.csv 进行比较，我会得到 2 行不匹配(第一行和第二行)

因此，20161202.csv 和 20161203.csv 是我的目标文件。

我正在寻找可以执行相同操作的一系列 bash 命令。

PS:文件中的行数很大(大约 3000 行)，您可以假设所有文件具有相同的年份和月份(文件数<30)。

最佳答案

如果不检查文件名是否符合日期比较规则(数据文件与日期+1 文件)，您可以这样做:

while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0) 
#Load all filenames in an array. Using null separation we ensure that filenames will be  
#handled correctly no matter if they do contain spaces or other special chars.

max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
  a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
  differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
  echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
  [[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done

echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found

获取两个文件之间不匹配行的核心是grep。您可以手动尝试 grep 以查看结果是否正确:

grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2)

grep 选项:
-v :返回不匹配的行(grep的逆操作)
-F : 固定-不是正则表达式-匹配
-w : 单词匹配，避免 5000 与 50000 匹配
-f :从文件加载模式，特别是从文件 1、字段 2。使用此模式，我们将 grep/搜索文件 2 的字段 2。
wc -l :计算匹配项 = 不匹配的行 <(cut -d' ' -f2 file2) : 我们 grep file2 的 field2 而不是整个 file2 以避免 file2 的其他列中 file1/field2 的可能匹配而不是 column2

awk 的替代解决方案

代替 grep ，您可以使用这样的 awk:

awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2

这将打印与 grep -v

相同的结果

file1/field2($2) 将加载到数组 a
将打印不在此数组中的 file2/field2 ($2) 行(非匹配字段)。

也可以通过管道传输到 |wc -l 来计算不匹配的行数，就像在 grep 中一样。

所以如果你更喜欢使用 awk，这一行:

differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)

必须改为:

differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)

无论如何，您似乎需要一个数组来保存文件名，然后您需要一个循环来遍历文件并成对比较它们。

关于linux - 如何获得一系列文件中具有最大差异的两个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42415826/

linux - 如何获得一系列文件中具有最大差异的两个文件

awk 的替代解决方案

上一篇：linux - 将文件名参数从 Bash 传递到 Perl

下一篇：linux - 库伯内斯 : hostname regex failed