python - 在 Python 中比较 2 个巨大的 csv 文件

标签 python

我有 2 个 csv 文件。

文件1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vinoth,12,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

文件2:

EmployeeName,Age,Salary,Address
Karthick,10,10.245,"140,North Street,India"
Vivek,20,2000,"USA"
Vinoth,12,2548.245,"140,North Street,India"

我想比较这两个文件并将差异报告到另一个 csv 文件中。我使用了下面的Python代码(版本2.7)

#!/usr/bin/env python
import difflib
import csv

with open('./Input/file1', 'r' ) as t1:
    fileone = t1.readlines()
with open('./Input/file2', 'r' ) as t2:
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

当我执行时,以下是我得到的输出:

实际输出

Vivek,20,2000,"USA"

但是我的预期输出低于,因为 file1 中“Vinoth”的记录出现了 2 次,但在 file2 中只出现了 1 次。

预期输出

Vinoth,12,2548.245,"140,North Street,India"
Vivek,20,2000,"USA"

问题

  1. 请告诉我如何获得预期的输出。
  2. 另外,如何获取差异记录的文件名和行号到输出文件?

最佳答案

您遇到的问题是 in 关键字仅检查某个项目是否存在,而不检查该项目是否存在两次。如果您愿意使用外部包,则可以使用 pandas 快速完成此操作。

import pandas as pd

df1 = pd.read_csv('Input/file1.csv')
df2 = pd.read_csv('Input/file2.csv')

# create a new column with the count of how many times the row exists
df1['count'] = 0
df2['count'] = 0
df1['count'] = df1.groupby(df1.columns.to_list()[:-1]).cumcount() + 1
df2['count'] = df2.groupby(df2.columns.to_list()[:-1]).cumcount() + 1

# merge the two data frames with and outer join, add an indicator variable
# to show where each row (including the count) exists.
df_all = df1.merge(df2, on=df1.columns.to_list(), how='outer', indicator='exists')
print(df_all)
# prints:
  EmployeeName  Age    Salary                 Address  count      exists
0       Vinoth   12  2548.245  140,North Street,India      1        both
1       Vinoth   12  2548.245  140,North Street,India      2   left_only
2     Karthick   10    10.245  140,North Street,India      1        both
3        Vivek   20  2000.000                     USA      1  right_only

# clean up exists column and export the rows do not exist in both frames
df_all['exists'] = (df_all.exists.str.replace('left_only', 'file1')
                                 .str.replace('right_only', 'file2'))
df_all.query('exists != "both"').to_csv('update.csv', index=False)

编辑:非 pandas 版本

您可以使用行作为键、计数作为值来检查相同行数的差异。

from collection import defaultdict

c1 = defaultdict(int)
c2 = defaultdict(int)

with open('./Input/file1', 'r' ) as t1:
    for line in t1:
        c1[line.strip()] += 1

with open('./Input/file2', 'r' ) as t2:
    for line in t2:
        c2[line.strip()] += 1

# create a set of all rows
all_keys = set()
all_keys.update(c1)
all_keys.update(c2)

# find the difference in the number of instances of the row
out = []
for k in all_keys:
    diff = c1[k] - c2[k]
    if diff == 0:
        continue
    if diff > 0:
        out.extend([k + ',file1'] * diff) # add which file it came from
    if diff < 0:
        out.extend([k + ',file2'] * abs(diff)) # add which file it came from

with open('update.csv', 'w') as outFile:
    outFile.write('\n'.join(out))

关于python - 在 Python 中比较 2 个巨大的 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59785589/

相关文章:

python - 当 ini 文件设置类型时创建一组对象 (Python)

python - 无法在 Windows 7 上安装 Python (0xc0000005)

python - 字典项目的顺序是否可靠**在创建时**?

python - 获取所有加载的 python 包和版本以及变量的列表

python - Qt:在映射到 AbstractListModel 的 LineEdit 小部件上显示工具提示

python - string.format 与 css 标签冲突 : { } 's

python - 根据 bins 组合字典键

python - 如何使用列名列表对数据框进行排序

python - 如何 "pretty print"python pandas DatetimeIndex

python - pyTelegramBotAPI。如何在 next_step_handler 解决方案中保存状态?