我有一个包含字符串和 float 列表的数据框,可以说
Names Prob
[Anne, Mike, Anne] [10.0, 10.0, 80.0]
[Sophie, Andy, Vera, Kate] [30.0, 4.5, 5.5, 60.0]
[Josh, Anne, Sophie] [51, 24, 25]
我想要做的是循环Names
,如果名称包含在预定义组中,则重新标记它,然后聚合Prob
中的相应数字。
例如,如果team1 = ['Anne', 'Mike', 'Sophie']
我想最终得到:
Names Prob
[Team_One] [100.0]
[Andy, Kate, Team_One, Vera] [4.5, 60.0, 30.0, 5.5]
[Josh, Team_One] [51, 49]
我写的是这个,但我认为这有点荒谬,我在循环内创建一个临时数据帧,然后进行分组;对我来说听起来有点矫枉过正,而且太重了。
请问有更有效的方法吗? (如果重要的话我正在使用 Python 3)
import pandas as pd
def pool(df):
team1 = ['Anne', 'Mike', 'Sophie']
names = df['Names']
prob = df['Prob']
out_names = []
out_prob = []
for key, name in enumerate(names):
# relabel if in team1 otherwise keep it the same
name = ['Team_One' if x in team1 else x for x in name]
# make a temp dataframe and group by name
temp = pd.DataFrame({'name': name, 'prob': prob[key]} )
temp = temp.groupby('name').sum()
# make the output
out_names.append(temp.index.tolist())
out_prob.append(temp['prob'].tolist())
df['Names'] = out_names
df['Prob'] = out_prob
return df
df = pd.DataFrame({
'Names':[['Anne', 'Mike', 'Anne'],
['Sophie', 'Andy', 'Vera', 'Kate'],
['Josh', 'Anne', 'Sophie']
],
'Prob': [[10., 10., 80.],
[30., 4.5, 5.5, 60.],
[51, 24, 25]
]
})
out = pool(df)
print(out)
谢谢!
最佳答案
使用 defaultdict
计算列表中所有值的总和,然后将其转换为元组列表并传递给 DataFrame 构造函数:
from collections import defaultdict
out = []
for a, b in zipped:
d = defaultdict(int)
for x, y in zip(a, b):
if x in team1:
d['Team_One'] +=y
else:
d[x] = y
out.append((list(d.keys()), list(d.values())))
df = pd.DataFrame(out, columns=['Names','Prob'])
print (df)
Names Prob
0 [Team_One] [100.0]
1 [Team_One, Andy, Vera, Kate] [30.0, 4.5, 5.5, 60.0]
2 [Josh, Team_One] [51, 49]
如果Prob
中没有0
值,解决方案有效:
out = []
for a, b in zipped:
n, p = [],[]
tot = 0
for x, y in zip(a, b):
if x in team1:
tot +=y
else:
n.append(x)
p.append(y)
if tot != 0:
p.append(tot)
n.append('Team_One')
out.append((n, p))
df = pd.DataFrame(out, columns=['Names','Prob'])
print (df)
Names Prob
0 [Team_One] [100.0]
1 [Andy, Vera, Kate, Team_One] [4.5, 5.5, 60.0, 30.0]
2 [Josh, Team_One] [51, 49]
在 pandas 中,处理列表在列中的速度很慢,因此最好是首先展平列表:
from itertools import chain
lens = [len(x) for x in df['Names']]
df = pd.DataFrame({
'row' : np.arange(len(df)).repeat(lens),
'Names' : list(chain.from_iterable(df['Names'].tolist())),
'Prob' : list(chain.from_iterable(df['Prob'].tolist()))
})
然后将值替换为 isin
和最后一个聚合sum
:
team1 = ['Anne', 'Mike', 'Sophie']
df.loc[df['Names'].isin(team1), 'Names'] = 'Team_One'
df = df.groupby(['row','Names'], as_index=False, sort=False)['Prob'].sum()
print (df)
row Names Prob
0 0 Team_One 100.0
1 1 Team_One 30.0
2 1 Andy 4.5
3 1 Vera 5.5
4 1 Kate 60.0
5 2 Josh 51.0
6 2 Team_One 49.0
关于python - 使用另一个列表中的键聚合列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54710720/