python - 将一个巨大的json字符串反序列化为python对象

我正在使用 simplejson 将 json 字符串反序列化为 python 对象。我有一个自定义编写的 object_hook，负责将 json 反序列化回我的域对象。

问题是，当我的 json 字符串很大时(即服务器以 json 字符串的形式返回大约 80 万个域对象)，我的 python 反序列化器需要将近 10 分钟来反序列化它们。

我进一步深入研究，看起来 simplejson 本身并没有做太多工作，而是将所有内容都委托(delegate)给了 object_hook。我尝试优化我的 object_hook 但这也没有提高我的性能。 (我几乎没有得到 1 分钟的改进)

我的问题是，我们是否有任何其他经过优化以处理庞大数据集的标准框架，或者是否有一种方法可以让我利用框架的功能而不是在 object_hook 级别做所有事情。

我看到没有 object_hook 框架只返回一个字典列表而不是域对象列表。

此处的任何指示都会有用。

仅供引用，我使用的是 simplejson 版本 3.7.2

这是我的示例 _object_hook:

def _object_hook(dct):
    if '@CLASS' in dct: # server sends domain objects with this @CLASS 
        clsname = dct['@CLASS']
        # This is like Class.forName (This imports the module and gives the class)
        cls = get_class(clsname)
        # As my server is in java, I convert the attributes to python as per python naming convention.
        dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
       if cls != None:
            obj_key = None
            if "@uuid"in dct
                obj_key = dct["@uuid"]
                del(dct["@uuid"])
            else:
                info("Class missing uuid: " + clsname)
            dct.pop("@CLASS", None)

            obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed 
            if obj_key is not None:
                shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
        else:
            warning("class not found: " + clsname)
            obj = dct

        return obj
    else:
        return dct

示例响应:

    {"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762-  9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4":  {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8-  74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}

我有多层嵌套，我从服务器接收的记录数超过 800K。

最佳答案

我不知道有什么框架可以开箱即用，但您可以对类实例的设置方式进行一些优化。

由于将字典解压缩为关键字参数并将它们应用于您的类变量占用了大量时间，您可以考虑将 dct 直接传递给您的类 __init__并使用 dct 设置类字典 cls.__dict__:

试验 1

In [1]: data = {"name": "yolanda", "age": 4}

In [2]: class Person:
   ...:     def __init__(self, name, age):
   ...:         self.name = name
   ...:         self.age = age
   ...:
In [3]: %%timeit
   ...: Person(**data)
   ...:
1000000 loops, best of 3: 926 ns per loop

试验 2

In [4]: data = {"name": "yolanda", "age": 4}

In [5]: class Person2:
   ....:     def __init__(self, data):
   ....:         self.__dict__ = data
   ....:
In [6]: %%timeit
   ....: Person2(data)
   ....:
1000000 loops, best of 3: 541 ns per loop

不用担心 self.__dict__ 被另一个引用修改，因为对 dct 的引用在 _object_hook 返回之前丢失了。

这当然意味着更改您的 __init__ 的设置，您的类的属性严格取决于 dct 中的项目。这取决于你。

您也可以将 cls != None 替换为 cls is not None(只有一个 None 对象，因此身份检查更重要 python ):

试验 1

In [38]: cls = 5
In [39]: %%timeit
   ....: cls != None
   ....:
10000000 loops, best of 3: 85.8 ns per loop

试验 2

In [40]: %%timeit
   ....: cls is not None
   ....:
10000000 loops, best of 3: 57.8 ns per loop

你可以用一行替换两行:

obj_key = dct["@uuid"]
del(dct["@uuid"])

成为:

obj_key = dct.pop('@uuid') # Not an optimization as this is same with the above

在 80 万个域对象 的规模上，这些会为您节省一些获取object_hook 以更快地创建对象的好时间。

关于python - 将一个巨大的json字符串反序列化为python对象，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37861698/

python - 将一个巨大的json字符串反序列化为python对象

上一篇：python - 在 Pandas 中一起使用 loc 和 iloc

下一篇：python - 在 scikit-image 中导入相对/绝对函数时出现问题