python - 在 Python 中比较 2 个巨大的 csv 文件

我有 2 个 csv 文件。

文件1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vinoth,12,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

文件2:

EmployeeName,Age,Salary,Address
Karthick,10,10.245,"140,North Street,India"
Vivek,20,2000,"USA"
Vinoth,12,2548.245,"140,North Street,India"

我想比较这两个文件并将差异报告到另一个 csv 文件中。我使用了下面的Python代码(版本2.7)

#!/usr/bin/env python
import difflib
import csv

with open('./Input/file1', 'r' ) as t1:
    fileone = t1.readlines()
with open('./Input/file2', 'r' ) as t2:
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

当我执行时，以下是我得到的输出:

实际输出

Vivek,20,2000,"USA"

但是我的预期输出低于，因为 file1 中“Vinoth”的记录出现了 2 次，但在 file2 中只出现了 1 次。

预期输出

Vinoth,12,2548.245,"140,North Street,India"
Vivek,20,2000,"USA"

问题

请告诉我如何获得预期的输出。
另外，如何获取差异记录的文件名和行号到输出文件？

最佳答案

您遇到的问题是 in 关键字仅检查某个项目是否存在，而不检查该项目是否存在两次。如果您愿意使用外部包，则可以使用 pandas 快速完成此操作。

import pandas as pd

df1 = pd.read_csv('Input/file1.csv')
df2 = pd.read_csv('Input/file2.csv')

# create a new column with the count of how many times the row exists
df1['count'] = 0
df2['count'] = 0
df1['count'] = df1.groupby(df1.columns.to_list()[:-1]).cumcount() + 1
df2['count'] = df2.groupby(df2.columns.to_list()[:-1]).cumcount() + 1

# merge the two data frames with and outer join, add an indicator variable
# to show where each row (including the count) exists.
df_all = df1.merge(df2, on=df1.columns.to_list(), how='outer', indicator='exists')
print(df_all)
# prints:
  EmployeeName  Age    Salary                 Address  count      exists
0       Vinoth   12  2548.245  140,North Street,India      1        both
1       Vinoth   12  2548.245  140,North Street,India      2   left_only
2     Karthick   10    10.245  140,North Street,India      1        both
3        Vivek   20  2000.000                     USA      1  right_only

# clean up exists column and export the rows do not exist in both frames
df_all['exists'] = (df_all.exists.str.replace('left_only', 'file1')
                                 .str.replace('right_only', 'file2'))
df_all.query('exists != "both"').to_csv('update.csv', index=False)

编辑:非 pandas 版本

您可以使用行作为键、计数作为值来检查相同行数的差异。

from collection import defaultdict

c1 = defaultdict(int)
c2 = defaultdict(int)

with open('./Input/file1', 'r' ) as t1:
    for line in t1:
        c1[line.strip()] += 1

with open('./Input/file2', 'r' ) as t2:
    for line in t2:
        c2[line.strip()] += 1

# create a set of all rows
all_keys = set()
all_keys.update(c1)
all_keys.update(c2)

# find the difference in the number of instances of the row
out = []
for k in all_keys:
    diff = c1[k] - c2[k]
    if diff == 0:
        continue
    if diff > 0:
        out.extend([k + ',file1'] * diff) # add which file it came from
    if diff < 0:
        out.extend([k + ',file2'] * abs(diff)) # add which file it came from

with open('update.csv', 'w') as outFile:
    outFile.write('\n'.join(out))

关于python - 在 Python 中比较 2 个巨大的 csv 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59785589/

python - 在 Python 中比较 2 个巨大的 csv 文件

编辑:非 pandas 版本

上一篇：sql - SQL 探查器中是否有一种方法可以通过 'TextData' 过滤器和 OR 条件来过滤多个文本条件？

下一篇：c - 您如何解释这个反汇编列表？