python - 这些代码行如何耗尽我所有的 RAM？以及如何修复它？

我想修改python xml解析器不支持的字符，比如“é”或“×”，所以我编写了一个python脚本来处理它。所以它将变成“José Meseguer”到“Jose Meseguer”。它适用于采样的小 xml 文件，但在原始 2GB xml 文件上，会弹出内存不足错误。

我尝试了如下的o.write(line)，但内存似乎无法容纳那么多数据，并且我的IDE弹出 Traceback (most recent call last): File "E:/Output/dblp/preprocess.py", line 11, in <module> line = line.replace(line[index1: index2 + 1], line[index1 + 1]) MemoryError .

f = open("dblp.xml")
o = open("dblp_processed.xml", 'w')

for line in f:
    flag = line.find('&') != -1 and line.find(';') != -1
    if flag:
        index = 0
        while flag:
            index1 = line.find('&', index)
            index2 = line.find(';', index)
            line = line.replace(line[index1: index2 + 1], line[index1 + 1])
            index = index1 + 1
            flag = line.find('&', index) != -1 and line.find(';', index) != -1
        o.write(line)
    else:
        o.write(line)

f.close()
o.close()

我在学校服务器上尝试了这段代码，它占用了近 200GB 并且仍在运行。

f = open("dblp_sample.xml")
o = open("dblp_processed.xml", 'w')

o_lines = list()
for line in f:
    flag = line.find('&') != -1 and line.find(';') != -1
    if flag:
        index = 0
        while flag:
            index1 = line.find('&', index)
            index2 = line.find(';', index)
            line = line.replace(line[index1: index2 + 1], line[index1 + 1])
            index = index1 + 1
            flag = line.find('&', index) != -1 and line.find(';', index) != -1
        o_lines.append(line)
    else:
        o_lines.append(line)

o.writelines(o_lines)
f.close()
o.close()

最佳答案

首先，Python有一个built-in html module您可以使用它来替换 HTML 实体:

>>> import html
>>> html.unescape('&eacute &times;')
'é ×'

其次，一次只能操作一行，因此可以一次写入一行，而不是全部存储:

import html

with open("dblp_sample.xml") as f, open("dblp_processed.xml", 'w', encoding='utf-8') as o:
    for line in f:
        o.write(html.unescape(line))

循环也可以写成:

o.writelines(map(html.unescape, f))

关于python - 这些代码行如何耗尽我所有的 RAM？以及如何修复它？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57245400/

python - 这些代码行如何耗尽我所有的 RAM？以及如何修复它？

上一篇：python - 如何在 python 中将 XML 转换为 HTML？

下一篇：python - 使用共享内存计算点之间的距离