python - 读取 JSON 文件添加文件名列而不使用append()

我有 1500 个 json 文件，例如:

[
  {
    "info1": {
      "name": "John",
      "age" : "50"
      "country": "USA",
    },
    "info2": {
      "id1": "129",
      "id2": "151",
      "id3": "196",
    },
    "region": [
      {
        "id": "36",
        "name": "Spook",
        "spot": "2"
      },
      {
        "id": "11",
        "name": "Ghoul",
        "spot": "6"
      },
      {
        "id": "95",
        "lat": "Devil",
        "spot": "4"
      }
    ]
  }
  {
    "info1": {
      "name": "Mark",
      "age" : "33"
      "country": "Brasil",
    },
    "info2": {
      "id1": "612",
      "id2": "221",
      "id3": "850",
    },
    "region": [
      {
        "id": "68",
        "name": "Ghost",
        "spot": "7"
      },
      {
        "id": "75",
        "name": "Spectrum",
        "spot": "2"
      },
      {
        "id": "53",
        "name": "Phantom",
        "spot": "2"
      }
    ]
  }
]

我已将 json 文件中的重要信息加载到数据框中，并添加了包含 json 文件名的列。我的代码:

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))


for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    df2 = pd.DataFrame(columns=['name', 'date'])
    names=[]
    dates=[]

    for x in df['region']: 

        for name in x: 

            names.append(name['name']) 
            dates.append(file_)

df2['name']=names
df2['date']=dates

我的数据框如下所示:

      name           date  
0    Spook      20191111.json  
1    Ghoul      20191111.json  
2    Devil      20191111.json  
3    Ghost      20191111.json  
4    Spectrum   20191111.json  
5    Phantom    20191111.json

这个输出对我来说很满意，但是当我的文件夹中有 1500 个 json 文件时，将其加载到数据框中需要很长时间。这可能是由于使用append() 函数造成的。如何修改此代码以加快加载此 json 文件的速度？

预先感谢您的帮助。

最佳答案

我们可以通过使用 numpy flatten 和 pandas pandas concat 来做到这一点。请查看以下代码供您引用。

import numpy as np

path_to_json = 'my files_directory' 

json_files = glob.glob(os.path.join(path_to_json, "*.json"))

main_df = pd.DataFrame() # Dataframe which is going to carry data from all json files

for file_ in json_files:

    df = pd.read_json(file_)
    df = df.drop(columns=['info1', 'info2'])  # these columns is not important to me so I delete it

    regions = np.array(df['region'].tolist()).flatten() # converting array of array of objects into array of objects

    regions_df = pd.DataFrame(regions.tolist()) # tolist() method helps to convert numpy array to array

    regions_df["date"] = file_
    main_df = pd.concat([main_df, regions_df])

仅供引用:如果我们使用 pyspark，我们可以通过在 Spark 中使用 explode 来完成此操作。在 Pandas 中，explode 也在 0.25 中可用

希望这可以帮助您提高性能

关于python - 读取 JSON 文件添加文件名列而不使用append()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58939921/

python - 读取 JSON 文件添加文件名列而不使用append()

上一篇：python - 安装 OSQP 包时遇到问题

下一篇：python - 人员列表中 2 人或 3 人的团体的所有组合 [Python]