python - 在Python中读取大型JSON文件

我有一个大约 5GB 的大型 JSON 文件，但它不是由一个 JSON 文件组成，而是由多个连接在一起的。

{"created_at":"Mon Jan 13 20:01:57 +0000 2014","id":422820833807970304,"id_str":"422820833807970304"}
{"created_at":"Mon Jan 13 20:01:57 +0000     2014","id":422820837545500672,"id_str":"422820837545500672"}.....

大括号之间没有换行 }{.

我尝试使用 sed 将大括号替换为换行符，然后使用以下命令读取文件:

data=[]
for line in open(filename,'r').readline():
data.append(json.loads(line))

但这行不通。

如何才能相对快速地读取该文件？

非常感谢任何帮助

最佳答案

这是一个黑客行为。它不会将整个文件加载到内存中。我真的希望你使用 Python 3。

DecodeLargeJSON.py

from DecodeLargeJSON import *
import io
import json

# create a file with two jsons
f = io.StringIO()
json.dump({1:[]}, f)
json.dump({2:"hallo"}, f)
print(repr(f.getvalue()))
f.seek(0) 

# decode the file f. f could be any file from here on. f.read(...) should return str
o1, idx1 = json.loads(FileString(f), cls = BigJSONDecoder)
print(o1) # this is the loaded object
# idx1 is the index that the second object begins with
o2, idx2 = json.loads(FileString(f, idx1), cls = BigJSONDecoder)
print(o2)

如果您发现某些对象无法解码，可以告诉我，我们可以找到解决方案。

免责声明这不是有效且最佳的解决方案。这是一个展示如何使其成为可能的黑客。

讨论因为它不会将整个文件加载到内存中，所以正则表达式不起作用。它还使用 Python 实现而不是 C 实现。这可能会使速度变慢。我真的很讨厌这个简单的任务却如此困难。希望其他人指出不同的解决方案。

关于python - 在Python中读取大型JSON文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22900804/

python - 在Python中读取大型JSON文件

上一篇：python - while循环忽略打印？

下一篇：Python 脚本使用一个文件中的坐标并添加另一文件中匹配坐标的值