我有 2 个 csv 文件。
文件1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vinoth,12,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
文件2:
EmployeeName,Age,Salary,Address
Karthick,10,10.245,"140,North Street,India"
Vivek,20,2000,"USA"
Vinoth,12,2548.245,"140,North Street,India"
我想比较这两个文件并将差异报告到另一个 csv 文件中。我使用了下面的Python代码(版本2.7)
#!/usr/bin/env python
import difflib
import csv
with open('./Input/file1', 'r' ) as t1:
fileone = t1.readlines()
with open('./Input/file2', 'r' ) as t2:
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
for line in fileone:
if line not in filetwo:
outFile.write(line)
当我执行时,以下是我得到的输出:
实际输出
Vivek,20,2000,"USA"
但是我的预期输出低于,因为 file1 中“Vinoth”的记录出现了 2 次,但在 file2 中只出现了 1 次。
预期输出
Vinoth,12,2548.245,"140,North Street,India"
Vivek,20,2000,"USA"
问题
- 请告诉我如何获得预期的输出。
- 另外,如何获取差异记录的文件名和行号到输出文件?
最佳答案
您遇到的问题是 in
关键字仅检查某个项目是否存在,而不检查该项目是否存在两次。如果您愿意使用外部包,则可以使用 pandas 快速完成此操作。
import pandas as pd
df1 = pd.read_csv('Input/file1.csv')
df2 = pd.read_csv('Input/file2.csv')
# create a new column with the count of how many times the row exists
df1['count'] = 0
df2['count'] = 0
df1['count'] = df1.groupby(df1.columns.to_list()[:-1]).cumcount() + 1
df2['count'] = df2.groupby(df2.columns.to_list()[:-1]).cumcount() + 1
# merge the two data frames with and outer join, add an indicator variable
# to show where each row (including the count) exists.
df_all = df1.merge(df2, on=df1.columns.to_list(), how='outer', indicator='exists')
print(df_all)
# prints:
EmployeeName Age Salary Address count exists
0 Vinoth 12 2548.245 140,North Street,India 1 both
1 Vinoth 12 2548.245 140,North Street,India 2 left_only
2 Karthick 10 10.245 140,North Street,India 1 both
3 Vivek 20 2000.000 USA 1 right_only
# clean up exists column and export the rows do not exist in both frames
df_all['exists'] = (df_all.exists.str.replace('left_only', 'file1')
.str.replace('right_only', 'file2'))
df_all.query('exists != "both"').to_csv('update.csv', index=False)
编辑:非 pandas 版本
您可以使用行作为键、计数作为值来检查相同行数的差异。
from collection import defaultdict
c1 = defaultdict(int)
c2 = defaultdict(int)
with open('./Input/file1', 'r' ) as t1:
for line in t1:
c1[line.strip()] += 1
with open('./Input/file2', 'r' ) as t2:
for line in t2:
c2[line.strip()] += 1
# create a set of all rows
all_keys = set()
all_keys.update(c1)
all_keys.update(c2)
# find the difference in the number of instances of the row
out = []
for k in all_keys:
diff = c1[k] - c2[k]
if diff == 0:
continue
if diff > 0:
out.extend([k + ',file1'] * diff) # add which file it came from
if diff < 0:
out.extend([k + ',file2'] * abs(diff)) # add which file it came from
with open('update.csv', 'w') as outFile:
outFile.write('\n'.join(out))
关于python - 在 Python 中比较 2 个巨大的 csv 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59785589/