我有一个简单的代码,它读取 csv 文件,根据前 2 列查找重复项,然后将重复项写入另一个 csv 中,并在第三个 csv 中保留唯一值...
我正在使用集合:
def my_func():
area = "W09"
inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
#i = 0
seen = set()
with open(inf, 'r') as infile, open(out, 'w') as outfile1, open(out2, 'w') as outfile2:
reader = csv.reader(infile, delimiter=" ")
writer1 = csv.writer(outfile1, delimiter=" ")
writer2 = csv.writer(outfile2, delimiter=" ")
for row in reader:
x, y = row[0], row[1]
x = float(x)
y = float(y)
if (x, y) in seen:
writer2.writerow(row)
continue
seen.add((x, y))
writer1.writerow(row)
seen.clear()
我想,那个集合将是最好的选择,但是集合的大小是输入文件大小的七倍? (输入文件范围从 140 MB 到 50GB csv)和 RAM 使用量从 1GB 到几乎 400 GB(我使用的是具有 768 GB RAM 的服务器):
我还在小样本上使用了探查器
Line # Mem usage Increment Line Contents
8 21.289 MiB 21.289 MiB @profile
9 def my_func():
10 21.293 MiB 0.004 MiB area = "W10"
11
12 21.293 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.293 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.297 MiB 0.004 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17
18 #i = 0
19 21.297 MiB 0.000 MiB seen = set()
20
21 21.297 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
22 21.297 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
23 21.297 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
24 21.297 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
25 1089.914 MiB -9.008 MiB for row in reader:
26 1089.914 MiB -7.977 MiB x, y = row[0], row[1]
27
28 1089.914 MiB -6.898 MiB x = float(x)
29 1089.914 MiB 167.375 MiB y = float(y)
30
31 1089.914 MiB 166.086 MiB if (x, y) in seen:
32 #z = line.split(" ",3)[-1]
33 #if z == "5284":
34 # print X, Y, z
35
36 1089.914 MiB 0.004 MiB writer2.writerow(row)
37 1089.914 MiB 0.000 MiB continue
38 1089.914 MiB 714.102 MiB seen.add((x, y))
39 1089.914 MiB -9.301 MiB writer1.writerow(row)
40
41
42
43 690.426 MiB -399.488 MiB seen.clear()
可能是什么问题?有没有更快的方法来过滤结果? 或者使用更少内存的方式?
csv 示例: 我们正在查看转换为 csv 文件的 GeoTIFF,因此它是 X Y 值
475596 101832 4926
475626 101832 4926
475656 101832 4926
475686 101832 4926
475716 101832 4926
475536 101802 4926
475566 101802 4926
475596 101802 4926
475626 101802 4926
475656 101802 4926
475686 101802 4926
475716 101802 4926
475746 101802 4926
475776 101802 4926
475506 101772 4926
475536 101772 4926
475566 101772 4926
475596 101772 4926
475626 101772 4926
475656 101772 4926
475686 101772 4926
475716 101772 4926
475746 101772 4926
475776 101772 4926
475806 101772 4926
475836 101772 4926
475476 101742 4926
475506 101742 4926
编辑: 所以我尝试了Jean提供的解决方案: https://stackoverflow.com/a/49008391/9418396
结果是,在我的 140 MB csv 小集上,集的大小现在减半,这是一个很好的改进。我将尝试在更大的数据上运行它,看看它会做什么。我无法真正将其链接到探查器,因为探查器极大地延长了执行时间。
Line # Mem usage Increment Line Contents
8 21.273 MiB 21.273 MiB @profile
9 def my_func():
10 21.277 MiB 0.004 MiB area = "W10"
11
12 21.277 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.277 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.277 MiB 0.000 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17 21.277 MiB 0.000 MiB seen = set()
18
19 21.277 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
20 21.277 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
21 21.277 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
22 21.277 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
23 451.078 MiB -140.355 MiB for row in reader:
24 451.078 MiB -140.613 MiB hash = float(row[0])*10**7 + float(row[1])
25 #x, y = row[0], row[1]
26
27 #x = float(x)
28 #y = float(y)
29
30 #if (x, y) in seen:
31 451.078 MiB 32.242 MiB if hash in seen:
32 451.078 MiB 0.000 MiB writer2.writerow(row)
33 451.078 MiB 0.000 MiB continue
34 451.078 MiB 78.500 MiB seen.add((hash))
35 451.078 MiB -178.168 MiB writer1.writerow(row)
36
37 195.074 MiB -256.004 MiB seen.clear()
最佳答案
您可以创建自己的哈希函数,以避免存储 float 的元组
,而是使用一个浮点值以独特的方式将 float 组合在一起。
假设坐标不能超过 1000 万(也许你可以减少到 100 万),你可以这样做:
hash = x*10**7 + y
(这对 float 执行一种逻辑“OR”,并且由于值是有限的,因此 x
和 y
之间不会混淆)
然后将 hash
放入集合中,而不是 float 的 tuple
中。 10**14
没有 float 吸收的风险,值得一试:
>>> 10**14+1.5
100000000000001.5
循环变成:
for row in reader:
hash = float(row[0])*10**7 + float(row[1])
if hash in seen:
writer2.writerow(row)
continue
seen.add(hash)
writer1.writerow(row)
一个 float ,即使很大(因为 float 的大小是固定的),在内存中也至少比 2 个 float 的元组小 2 或 3 倍。在我的机器上:
>>> sys.getsizeof((0.44,0.2))
64
>>> sys.getsizeof(14252362*10**7+35454555.0)
24
关于python - 到底是什么占用了这么多内存?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49007929/