python - 循环期间释放内存

标签 python json performance optimization out-of-memory

我的代码中遇到内存错误。我的解析器可以这样总结:

# coding=utf-8
#! /usr/bin/env python
import sys
import json
from collections import defaultdict


class MyParserIter(object):

    def _parse_line(self, line):
        for couple in line.split(","):
            key, value = couple.split(':')[0], couple.split(':')[1]
            self.__hash[key].append(value)

    def __init__(self, line):
        # not the real parsing just a example to parse each
        # line to a dict-like obj
        self.__hash = defaultdict(list)
        self._parse_line(line)

    def __iter__(self):
        return iter(self.__hash.values())

    def to_dict(self):
        return self.__hash

    def __getitem__(self, item):
        return self.__hash[item]

    def free(self, item):
        self.__hash[item] = None

    def free_all(self):
        for k in self.__hash:
            self.free(k)

    def to_json(self):
        return json.dumps(self.to_dict())


def parse_file(file_path):
    list_result = []
    with open(file_path) as fin:
        for line in fin:
            parsed_line_obj = MyParserIter(line)
            list_result.append(parsed_line_obj)
    return list_result


def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        for obj in list_obj:
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
            obj = None

if __name__ == '__main__':
        result_list = parse_file('test.in')
        print(sys.getsizeof(result_list))
        write_to_file(result_list)
        print(sys.getsizeof(result_list))
        # the same result for memory usage result_list
        print(sys.getsizeof([None] * len(result_list)))
        # the result is not the same :(

目的是解析(大)文件,将每一行转换为一个 json 对象,该对象将被写回文件。

我的目标是减少占用空间,因为在某些情况下此代码会引发内存错误。在每个 fout.write 之后我想删除(空闲内存)obj 引用。

我尝试将 obj 设置为 None of call the method obj.free_all() 但它们都没有释放内存。我还使用了 simplejson 而不是 json,它减少了占用空间,但在某些情况下仍然太大。

test.in 看起来像:

test1:OK,test3:OK,...
test1:OK,test3:OK,...
test1:OK,test3:OK,test4:test_again...
....

最佳答案

不要在数组中存储很多类的实例,而是内联。示例。

% cat test.in
test1:OK,test3:OK
test1:OK,test3:OK
test1:OK,test3:OK,test4:test_again

% cat test.py 
import json

with open("test.in", "rb") as src:
    with open("out.out", "wb") as dst:
        for line in src:
            pairs, obj = [x.split(":",1) for x in line.rstrip().split(",")], {}
            for k,v in pairs:
                if k not in obj: obj[k] = []
                obj[k].append(v)
            dst.write(json.dumps(obj)+"\n")

% cat out.out
{"test1": ["OK"], "test3": ["OK"]}
{"test1": ["OK"], "test3": ["OK"]}
{"test1": ["OK"], "test3": ["OK"], "test4": ["test_again"]}

如果很慢,不要逐行写入文件,而是将转储的 json 字符串存储在数组中并执行 dst.write("\n".join(array))

关于python - 循环期间释放内存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35453406/

相关文章:

java - 什么更好 : cast to double or adding double zero?

c# - 比较 XML 节点的高效算法

performance - Intel x86 处理器的 L1 内存缓存记录在哪里?

Python-导入错误: No module named cy_ipc

python - 如何显示中文单词,而不是unicode单词

android - 在后台更新位置

sql - 选择嵌套 JSON 数组包含特定值的行

python - RobotFramework中货币字符串的数值计算

python - Django 模型 :how to select records from django auth_user_groups table

java - 如何访问从android中的restful web服务传递的json数组?