python - 列表理解之谜 - Python

标签 python list-comprehension deduplication

我创建了两个 CSV 列表。一个是原始 CSV 文件,另一个是该文件的 DeDuped 版本。我已将每个内容读入列表,并且出于所有意图和目的,它们的格式相同。每个列表项都是一个字符串。

我正在尝试使用列表理解来找出哪些项目被重复删除。原始的长度是 16939,DeDupe 的列表是 15368。相差 1571,但我的列表理解长度是 368。有什么想法吗?

deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")

#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")

# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]

编辑:对于评论中的一些注释,这是对我的列表比较和列表本身的测试。几乎一样的差别,20左右。不是我需要的1500!谢谢!

print len(deduped)
deduped = set(deduped)
print len(deduped)

print len(account_names)
account_names = set(account_names)
print len(account_names)


15368
15368
16939
15387

最佳答案

尝试运行此代码并查看它报告的内容。这需要 Python 2.7 或更高版本的 collections.Counter 但您可以轻松编写自己的计数器代码,或从另一个答案复制我的示例代码:Python : List of dict, if exists increment a dict value, if not append a new dict

from collections import Counter

# read in original records
with open("account_names.csv", "rt") as f:
    rows = sorted(line.strip() for line in f)

# count how many times each row appears
counts = Counter(rows)

# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)

# sort the list from largest number of dups to least
dups.sort(reverse=True)

# print a report showing how many dups
for count, row in dups:
    print("{}\t{}".format(count, row))

# get de-duped list
unique_rows = sorted(counts)

# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
    de_duped = sorted(line.strip() for line in f)

print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
        len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))

# lists should match since we sorted both lists
if unique_rows == de_duped:
    print("perfect match!")
else:
    # if lists don't match, find out what is going on
    uniques_set = set(unique_rows)
    deduped_set = set(de_duped)

    # find intersection of the two sets
    x = uniques_set.intersection(deduped_set)

    # print differences
    if x != uniques_set:
        print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
    if x != deduped_set:
        print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))

关于python - 列表理解之谜 - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19670070/

相关文章:

python - 如何在 python 中验证 SSL 证书?

android - sl4a python 为安卓手机做一个Toast

python - 列出对不同长度元素的理解

python - 从列表字典中理解列表

mysql - 删除重复的字符串列表

python - 有没有办法在字典的键值对中包含用户输入提示或 time.sleep() 函数?

python - 如何从单词的开头删除任意数量的非字母符号?

python - 将列表拆分为不均匀的元组

Python 元组列表去重

sql - 在 SQL Server 中删除带有联接的表