python - 在 csv 中查找重复项和重复项的唯一性

我需要创建一个脚本，它将一个 csv(有时标记为 .inf)加载到内存中，并评估数据的重复类型。 csv 本身在每个字段中总是有不同的信息，但列将是相同的。大约 100~ 列。在我的示例中，为了便于阅读，我将把它缩小到 10 列。

我正在寻找的重复的“类型”有点奇怪。我需要首先找到第 2 列中的所有重复项。然后我需要查看该组重复项，并查看第 8 列(在我的实际 csv 中，它将是第 84 列)。
查看第 8 列，我只需要输出以下数据:

A. 第 2 栏重复

B. 在第 8 列中是唯一的

第 2 列可能只有 2 个重复，它们的第 8 列是相同的。我不需要看到那个。如果第 2 列中有 3 个重复项，并且它们的第 8、2 列相同，并且第 1 列是唯一的，则我需要查看所有 3 个 FULL 行。

Desired input
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,837iec,john;doe,10/1/2019,ryzen,split,32929,12345,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

这些数据会不断变化，甚至第 8 列中的字符数也可能会有所不同。

Desired output
m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

您可以从我想要的输出中看到，我不需要看到带有 837iec 的行，因为虽然它们的第 2 列是重复的，但第 8 列彼此匹配。我不需要看到那个。而对于像 382ork 这样的东西，列 8s 中的 2 个匹配，一个是唯一的。我需要看到所有 3。

我将在 unix 系统上使用它，我想要的使用方式是输入“./scriptname filename.csv”，输出可以是标准输出，也可以是日志文件(如果需要)。

我一直无法找到一种方法来做到这一点，因为我需要比较第 8 列的方式让我感到困惑。任何帮助将不胜感激。

我在另一个线程中发现了这个，它至少让我得到了第 2 列重复项的完整行。我想我不完全理解它是如何工作的。

#!/usr/bin/awk -f
{
    lines[$1][NR] = $0;
}
END {
    for (vehid in lines) {
        if (length(lines[vehid]) > 1) {
            for (lineno in lines[vehid]) {
                # Print duplicate line for decision purposes
                print lines[vehid][lineno];
                # Alternative: print line number and line
                #print lineno, lines[vehid][lineno];
            }
        }
    }
}

我的问题是它没有考虑下一列。它也不能很好地处理空白列。我的 csv 将有 100~ 列，其中 50~ 可能是完全空白的。

最佳答案

你能不能试试以下。

awk '
BEGIN{
  FS=","
}
FNR==NR{
  a[$2]++
  b[$2,$8]++
  c[$2]=(c[$2]?c[$2] ORS:"")$0
  next
}
a[$2]>1 && b[$2,$8]==1{
  print c[$2]
  delete a[$2]
}' <(sort -t',' -k2 Input_file) <(sort -t',' -k2 Input_file)

您显示的示例输出如下。

m,123veh,john;doe,10/1/2019,ryzen,split,32929,38757ace,turn,left
m,123veh,john;doe,10/1/2019,ryzen,split,32929,495842,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,38757,turn,left
m,382ork,john;doe,10/1/2019,ryzen,split,32929,4978d87,turn,left

说明:为上述代码添加详细说明。

awk '                                                     ##Starting awk program from here.
BEGIN{                                                    ##Starting BEGIN section from here.
  FS=","                                                  ##Setting FS as comma here.
}                                                         ##Closing BEGIN section here.
FNR==NR{                                                  ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  a[$2]++                                                 ##Creating an array named a whose index is $2 and increment its value with 1 each time it comes here.
  b[$2,$8]++                                              ##Creating an array named b whose index is $2,$8 and increment its value with 1 each time it comes here.
  c[$2]=(c[$2]?c[$2] ORS:"")$0                            ##Creating an array named c whose index is $2 and value will be keep concatenating its same indexs whole line value.
  next                                                    ##next will skip all further statements from here.
}                                                         ##Closing BLOCK for FNR==NR condition here.
a[$2]>1 && b[$2,$8]==1{                                   ##Checking condition if array a with index $2 value is greater than 1 AND array b with index $2,$8 value is 1.
  print c[$2]                                             ##Then print array c value with $2 here.
  delete a[$2]                                            ##Deleting array a value with $2 here which will make sure NO DUPLICATE lines are getting printed.
}' <(sort -t',' -k2 file) <(sort -t',' -k2 file)          ##Sending Input_files in sorted format from 2nd field to make sure all values are coming together before doing operations on it.

关于python - 在 csv 中查找重复项和重复项的唯一性，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59164259/

python - 在 csv 中查找重复项和重复项的唯一性

上一篇：r - 如何用 R 中的零替换 <NA> 值？

下一篇：emacs regexp匹配首字母大写的字符串