python - 在 Python 中读取大量的 json 文件？

这不是关于读取大型 JSON 文件，而是关于以最有效的方式读取大量 JSON 文件。

问题

我正在使用 last.fm来自 Million song dataset 的数据集. 数据以一组 JSON 编码的文本文件的形式提供，其中的键是:track_id、艺术家、标题、时间戳、类似内容和标签。

目前，在经过几个选项后，我正在通过以下方式将它们读入 pandas，因为这是最快的，如图所示 here :

import os
import pandas as pd
try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json


# Path to the dataset
path = "../lastfm_train/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)

当前方法读取子集(在不到一秒的时间内读取完整数据集的 1%)。然而，阅读完整的火车集太慢了，需要很长时间才能阅读(我也等了几个小时)并且已经成为进一步任务的瓶颈，如 question here 中所示。 .

我还在使用 ujson 来提高解析 json 文件的速度，这可以从 this question here 中明显看出。

更新 1 使用生成器理解而不是列表理解。

data_list=(json.load(open(file)) for file in all_files)

最佳答案

如果您需要多次读写数据集，您可以尝试将 .json 文件转换为更快的格式。例如，在 pandas 0.20+ 中，您可以尝试使用 .feather 格式。

关于python - 在 Python 中读取大量的 json 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41638587/

python - 在 Python 中读取大量的 json 文件？

上一篇： python ， Pandas : GroupBy attributes documentation

下一篇：python - PyCharm 无法识别已安装的模块 (cx_oracle)