python - Pandas 为一列附加多列

标签 python pandas

如何使用 pandas 高效地为每个客户附加多个 KPI 值?

pivoted df 与 customers df 的联接会产生一些问题,因为国家/地区是透视数据框的索引,而国籍不在索引中。

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
                           'indicator':['z','x','z','x'],
                           'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
                           'nationality':['Germany','Austria'],
                           'value':[7,8]})

查看所需的粉红色结果: enter image description here

最佳答案

我认为你可以使用concat :

df_pivoted = countryKPI.pivot_table(index='country', 
                              columns='indicator', 
                              values='value', 
                              fill_value=0)
print (df_pivoted)    
indicator  x  z
country        
Austria    7  7
Germany    8  9

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1))
        customer  value  x  z
Austria   second      8  7  7
Germany    first      7  8  9                       


print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
         .reset_index()
         .rename(columns={'index':'nationality'})
         [['customer','nationality','value','x','z']])

  customer nationality  value  x  z
0   second     Austria      8  7  7
1    first     Germany      7  8  9

按评论编辑:

问题是 customers.nationality 列的 dtypescountryKPI.countrycategory 并且如果某些类别是缺失,会引发错误:

ValueError: incompatible categories in categorical concat

解决方案通过 union 查找常见类别然后set_categories :

import pandas as pd
import numpy as np

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
                           'indicator':['z','x','z','x'],
                           'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
                           'nationality':['Slovakia','Austria'],
                           'value':[7,8]})

customers.nationality = customers.nationality.astype('category')
countryKPI.country = countryKPI.country.astype('category')

print (countryKPI.country.cat.categories)
Index(['Austria', 'Germany'], dtype='object')

print (customers.nationality.cat.categories)
Index(['Austria', 'Slovakia'], dtype='object')

all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories)
print (all_categories)
Index(['Austria', 'Germany', 'Slovakia'], dtype='object')

customers.nationality = customers.nationality.cat.set_categories(all_categories)
countryKPI.country = countryKPI.country.cat.set_categories(all_categories)
df_pivoted = countryKPI.pivot_table(index='country', 
                              columns='indicator', 
                              values='value', 
                              fill_value=0)
print (df_pivoted)    
indicator  x  z
country        
Austria    7  7
Germany    8  9
Slovakia   0  0        

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
         .reset_index()
         .rename(columns={'index':'nationality'})
         [['customer','nationality','value','x','z']])

  customer nationality  value  x  z
0   second     Austria    8.0  7  7
1      NaN     Germany    NaN  8  9
2    first    Slovakia    7.0  0  0

如果需要更好的性能,请改为 pivot_table使用groupby :

df_pivoted1 = countryKPI.groupby(['country','indicator'])
                        .mean()
                        .squeeze()
                        .unstack()
                        .fillna(0)
print (df_pivoted1)
indicator    x    z
country            
Austria    7.0  7.0
Germany    8.0  9.0
Slovakia   0.0  0.0

时间:

In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0)
100 loops, best of 3: 6.24 ms per loop

In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0)
100 loops, best of 3: 4.28 ms per loop

关于python - Pandas 为一列附加多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39634312/

相关文章:

python - 如何使用另一列中的一个键将 pandas df 与多列合并?

python - 为什么我必须为 SQLAlchemy 更改 Flask 中的 uri?

python - Pandas 重采样和插值功能太慢

python - Pandas : from a two columns dataframe to a (time series) multi-columned dataFrame

python - 使用 BeautifulSoup 解析长 html 失败,输出已解析一半

python - Numpy:从 2 个真实的数组中创建一个复杂的数组?

python - 类型错误:在 Pandas DataFrame 上使用 dask 时无法腌制 _thread._local 对象

python - pandas df 中 “look ahead” 值的有效方法

python - Pandas 数据框日期时间过滤器不起作用

python - 使用 python 替换日期列中缺少的月份和年份