python - 我写了一个 python 脚本来删除 csv 的重复数据，我认为它的工作效率是 90%。确实需要一些帮助来解决一个问题

该代码应该通过比较名字、姓氏和电子邮件来查找重复项。所有重复项应写入 Dupes.csv 文件，所有唯一项应写入 Deduplicated.csv，但目前尚未发生这种情况。

示例:

如果 A 行在 Orginal.csv 中出现 10 次，代码会将 A1 写入 deduplicated.csv，并将 A2 - A10 写入 dupes.csv。

这是不正确的。 A1-A10 应全部写入 dupes.csv 文件，仅在 deduplicated.csv 中保留唯一行。

另一个奇怪的行为是 A2-A10 都被写入 dupes.csv 两次!

我真的很感激任何和所有的反馈，因为这是我的第一个专业 python 脚本，我感到非常沮丧。

这是我的代码:

import csv

def read_csv(filename):
    the_file = open(filename, 'r', encoding='latin1')
    the_reader = csv.reader(the_file, dialect='excel')
    table = []
    #As long as the table row has values we will add it to the table
    for row in the_reader:
        if len(row) > 0:
            table.append(tuple(row))
    the_file.close()
    return table


def create_file(table, filename):
    join_file = open(filename, 'w+', encoding='latin1')
    for row in table:
        line = ""
        #build up the new row - don't comma on last item so add last item separate
        for i in range(len(row)-1):
            line += row[i] + ","
        line += row[-1]
        #adds the string to the new file
        join_file.write(line+'\n')
    join_file.close()


def main():
    original = read_csv('Contact.csv')

    print('finished read')
    #hold duplicate values
    dupes = []
    #holds all of the values without duplicates
    dedup = set()
    #pairs to know if we have seen a match before
    pairs = set()
    for row in original:
        #if row in dupes:
            #dupes.append(row)
        if (row[4],row[5],row[19]) in pairs:
            dupes.append(row)
        else:
            pairs.add((row[4],row[5],row[19]))
            dedup.add(row)

    print('finished first parse')
    #go through and add in one more of each duplicate
    seen = set()
    for row in dupes:
        if row in seen:
            continue
        else:
            dupes.append(row)
            seen.add(row)

    print ('writing files')
    create_file(dupes, 'duplicate_leads.csv')
    create_file(dedup, 'deduplicated_leads.csv')

if __name__ == '__main__':
    main()

最佳答案

你应该研究一下 pandas 模块，它会非常快，并且比你自己的模块容易得多。

import pandas as pd

x = pd.read_csv('Contact.csv')

duplicates = x.duplicated(['row4', 'row5', 'row19'], keep = False) 
#use the names of the columns you want to check

x[duplicates].to_csv('duplicates.csv') #write duplicates

x[~duplicates].to_csv('uniques.csv') #write uniques

关于python - 我写了一个 python 脚本来删除 csv 的重复数据，我认为它的工作效率是 90%。确实需要一些帮助来解决一个问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43415191/

python - 我写了一个 python 脚本来删除 csv 的重复数据，我认为它的工作效率是 90%。确实需要一些帮助来解决一个问题

上一篇：python - Django - 在测试用例中定义子类

下一篇：Python - 对于文本文件中的每个值？