这是我的 pandas 数据框的示例,它包含接近 100k 行
import pandas as pd
df = pd.DataFrame({'cluster': ['5', '5', '5', '5', '5', '5'],
'mdse_item_i': ['23627102',
'23627102',
'23627102',
'23627102',
'23627102',
'23627102'],
'predPriceQty': ['35.675543',
'33.236678',
'35.675543',
'35.675543',
'35.675543',
'35.675543'],
'schedule_i': ['56', '56', '56', '56', '56', '56'],
'segment_id': ['4123', '4123', '4144', '4161', '4295', '4454'],
'wk': ['1', '2', '1', '1', '1', '1']} )
下面是我想要实现的字典的嵌套格式
{(4123, 5): {56.0: {23627102.0: {1: 35.6755430505491, 2:33.236678}}},
(4144, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4161, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4295, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4454, 5): {56.0: {23627102.0: {1: 35.6755430505491}}}}
下面的代码适用于我,但对于巨大的数据帧,创建字典需要几个小时,并且我试图避免逐行迭代
forecast_dict_all = {}
for _, row in df.iterrows():
item_agg_id = int(row[segment_id])
mdse_item_i = row["mdse_item_i"]
cluster = int(row["cluster"])
wk = int(row["wk"])
forecast = float(row["predPriceQty"])
schedule_id = row["schedule_i"]
if (item_agg_id, cluster) not in forecast_dict_all:
forecast_dict_all[item_agg_id, cluster] = {
schedule_id: {mdse_item_i: {wk: forecast}}
}
到目前为止我的解决方案
dict(df.groupby(['segment_id','cluster'],as_index=False).apply(lambda x: x.to_dict()).to_dict())
df.set_index(['segment_id', 'cluster'], inplace=True)
di = df.to_dict(orient='index')
forecast_dict_all = {k:{v['schedule_i']: {v['mdse_item_i']: {v['wk']: v['predPriceQty']}}}
for k,v in di.items()}
df.set_index(['segment_id', 'cluster'], inplace=True)
{k:{grp['schedule_i']: {grp['mdse_item_i']: {grp['wk']: grp['predPriceQty']}}}
for k, grp in df.groupby(['schedule_i','mdse_item_i','wk','predPriceQty'])}
我什至尝试使用压缩,但在这两种情况下,我都无法实现所需的输出。
编辑 我在用 python :2.7.13.final.0 Pandas :0.20.1
感谢任何帮助,谢谢
最佳答案
我不知道这是否会更快,但它给出了示例数据的预期输出。
df = pd.DataFrame(d)
df = df.astype(dtype={'cluster': int, 'mdse_item_i': int, 'predPriceQty': float,
'schedule_i': int, 'segment_id': int, 'wk': int})
df.drop_duplicates(inplace=True)
df.set_index(['segment_id', 'cluster'], inplace=True)
answer = df.apply(lambda row:
{row['schedule_i']: {row['mdse_item_i']: {row['wk']: row['predPriceQty']}}},
axis=1).to_dict()
结果:
{(4123, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4144, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4161, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4295, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4454, 5): {56.0: {23627102.0: {1.0: 35.675543}}}}
注意:我修复了数据框的类型,因为您在代码中这样做,但获得正确类型的最佳时间是创建数据框时。
关于python - 将数据帧转换为没有列名的嵌套字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73334697/