我有 1500 个 json 文件,例如:
[
{
"info1": {
"name": "John",
"age" : "50"
"country": "USA",
},
"info2": {
"id1": "129",
"id2": "151",
"id3": "196",
},
"region": [
{
"id": "36",
"name": "Spook",
"spot": "2"
},
{
"id": "11",
"name": "Ghoul",
"spot": "6"
},
{
"id": "95",
"lat": "Devil",
"spot": "4"
}
]
}
{
"info1": {
"name": "Mark",
"age" : "33"
"country": "Brasil",
},
"info2": {
"id1": "612",
"id2": "221",
"id3": "850",
},
"region": [
{
"id": "68",
"name": "Ghost",
"spot": "7"
},
{
"id": "75",
"name": "Spectrum",
"spot": "2"
},
{
"id": "53",
"name": "Phantom",
"spot": "2"
}
]
}
]
我已将 json 文件中的重要信息加载到数据框中,并添加了包含 json 文件名的列。 我的代码:
path_to_json = 'my files_directory'
json_files = glob.glob(os.path.join(path_to_json, "*.json"))
for file_ in json_files:
df = pd.read_json(file_)
df = df.drop(columns=['info1', 'info2']) # these columns is not important to me so I delete it
df2 = pd.DataFrame(columns=['name', 'date'])
names=[]
dates=[]
for x in df['region']:
for name in x:
names.append(name['name'])
dates.append(file_)
df2['name']=names
df2['date']=dates
我的数据框如下所示:
name date
0 Spook 20191111.json
1 Ghoul 20191111.json
2 Devil 20191111.json
3 Ghost 20191111.json
4 Spectrum 20191111.json
5 Phantom 20191111.json
这个输出对我来说很满意,但是当我的文件夹中有 1500 个 json 文件时,将其加载到数据框中需要很长时间。这可能是由于使用append() 函数造成的。如何修改此代码以加快加载此 json 文件的速度?
预先感谢您的帮助。
最佳答案
我们可以通过使用 numpy flatten
和 pandas pandas concat
来做到这一点。请查看以下代码供您引用。
import numpy as np
path_to_json = 'my files_directory'
json_files = glob.glob(os.path.join(path_to_json, "*.json"))
main_df = pd.DataFrame() # Dataframe which is going to carry data from all json files
for file_ in json_files:
df = pd.read_json(file_)
df = df.drop(columns=['info1', 'info2']) # these columns is not important to me so I delete it
regions = np.array(df['region'].tolist()).flatten() # converting array of array of objects into array of objects
regions_df = pd.DataFrame(regions.tolist()) # tolist() method helps to convert numpy array to array
regions_df["date"] = file_
main_df = pd.concat([main_df, regions_df])
仅供引用:如果我们使用 pyspark,我们可以通过在 Spark 中使用 explode
来完成此操作。
在 Pandas 中,explode
也在 0.25 中可用
希望这可以帮助您提高性能
关于python - 读取 JSON 文件添加文件名列而不使用append(),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58939921/