如何使用 pandas 将新列添加到数据框中,其中包含嵌套数据框中的值之和,而不丢失任何其他列和嵌套数据?
具体来说,我想创建一个新列total_cost
,其中包含一行所有嵌套数据帧的总和。
我设法使用一系列 groupby
和 apply
创建以下数据框:
user_id description unit_summary
0 111 xxx [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx [{'total_period_cost': 100, 'unit_id': 'yyy', ...
我正在尝试添加列total_cost
,它是每个嵌套数据帧的total_period_cost
之和(按user_id
分组)。我怎样才能实现以下数据框?
user_id description total_cost unit_summary
0 111 xxx 300 [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx 100 [{'total_period_cost': 100, 'unit_id': 'yyy', ...
我的代码:
import pandas as pd
series = [{
"user_id":"111",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"xxx",
"cost_per_unit":50,
"total_period_usage":2
},
{
"total_period_cost":200,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 8
}
]
},
{
"user_id":"222",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 4
}
]
}]
df = pd.DataFrame(series)
print(df)
print(df.to_dict(orient='records'))
这是我用来实现 series
JSON 对象的 groupby..apply 代码示例:
import pandas as pd
series = [
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 8},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 3},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 1}
]
df = pd.DataFrame(series)
sumc = (
df.groupby(['user_id', 'unit_id', 'cost_per_unit'], as_index=False)
.agg({'total_period_usage': 'sum'})
)
sumc['total_period_cost'] = sumc.total_period_usage * sumc.cost_per_unit
sumc = (
sumc.groupby(['user_id'])
.apply(lambda x: x[['total_period_cost', 'unit_id', 'cost_per_unit', 'total_period_usage']].to_dict('r'))
.reset_index()
)
sumc = sumc.rename(columns={0:'unit_summary'})
sumc['description'] = 'xxx'
print(sumc)
print(sumc.to_dict(orient='records'))
通过添加 anky_91 答案中的以下内容解决了这个问题:
def myf(x):
return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
# Sum all server sumbscriptions total_period_cost
sumc['total_period_cost'] = sumc['unit_summary'].apply(myf)
最佳答案
您可以将 unit_summary
列中的每一行作为数据帧读取,并对所需的列求和:
方法1: apply
def myf(x):
return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
df['total_cost'] = df['unit_summary'].apply(myf)
print(df)
方法2: 类似地通过列表理解:
df['total_cost'] = [pd.DataFrame(i)['total_period_cost'].sum()
for i in df['unit_summary'].tolist()]
方法3:使用explode
:
m = df['unit_summary'].explode()
df['total_cost'] = pd.DataFrame(m.tolist(),index=m.index)['total_period_cost'].sum(level=0)
<小时/>
user_id description unit_summary \
0 111 xxx [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx [{'total_period_cost': 100, 'unit_id': 'yyy', ...
total_cost
0 300
1 100
除了上述内容之外,从原始数据帧开始,我们还可以执行如下操作来实现所需的输出,但这不会为您提供带有字典('unit_summary`)的系列:
(df.assign(total_cost=df['cost_per_unit']*df['total_period_usage'])
.groupby(['user_id'],as_index=False)['total_cost'].sum().assign(description='xxxx'))
<小时/>
user_id total_cost description
0 111 300 xxxx
1 222 100 xxxx
关于python - Pandas 使用嵌套数据框列的总和创建一列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59542279/