给定以下数据框:
Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5
0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN
1 Bisc EE RU02 Mk Bibs fdss34 1.0 NaN NaN NaN NaN
2 Bisc EE UA02,EURASIA Mk Crow fdsdr43 1.0 NaN NaN NaN NaN
3 Bisc WE FR31 Mk Ellis fdssdf3 1.0 NaN NaN NaN NaN
4 Bisc WE BE32,NL31 Mk Mower TOZ1720 1.0 NaN NaN NaN NaN
5 Bisc WE FR31,BE32,NL31 LKU Elan SKY8851 1.0 1.0 1.0 1.0 1.0
6 Bisc SE IT31 Mk Bobret 3dfsfg 1.0 NaN NaN NaN NaN
7 Bisc SE GR31 Mk Concept MOSGX009 1.0 NaN NaN NaN NaN
8 Bisc SE RU02,IT31,GR31,PT31,ES31 LKU Solar MSS5723 1.0 1.0 1.0 1.0 1.0
9 Bisc SE IT31,GR31,PT31,ES31 Mk Brix fdgd22 NaN 1.0 NaN NaN NaN
10 Choc CE RU02,CZ31,SK31,PL31,LT31 Fin Ocoser 43233d NaN 1.0 NaN NaN NaN
11 Choc CE DE31,AT31,HU31,CH31 Fin Smuth 4rewf NaN 1.0 NaN NaN NaN
12 Choc CE BG31,RO31,EMA Fin Momocs hgghg2 NaN 1.0 NaN NaN NaN
13 Choc WE FR31,BE32,NL31 Fin Bruntly ffdd32 NaN NaN NaN NaN 1.0
14 Choc WE FR31,BE32,NL31 Mk Ofer BROGX011 NaN 1.0 1.0 NaN NaN
15 Choc WE FR31,BE32,NL31 Mk Hem NZJ3189 NaN NaN NaN 1.0 1.0
16 G&C NE UA02,SE31 Mk Cre ORY9499 1.0 NaN NaN NaN NaN
17 G&C NE NO31 Mk Qlyo XVM7639 1.0 NaN NaN NaN NaN
18 G&C NE GB31,NO31,SE31,IE31,FI31 Mk Omny LOX1512 NaN 1.0 1.0 NaN NaN
我想将其导出到具有以下结构的嵌套字典中:
{RU02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{fdss34: Bibs}
{Bisc: {SE: {LKU: {Spend1: {MSS5723: Solar}
{Spend2: {MSS5723: Solar}
{Spend3: {MSS5723: Solar}
{Spend4: {MSS5723: Solar}
{Spend5: {MSS5723: Solar}
{Choc: {CE: {Fin: {Spend2: {43233d: Ocoser}
.....
{UA02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{ffdsdr43: Crow}
{G&C: {NE: {Mkt: {Spend1: {ORY9499: Cre}
.....
本质上,在这个字典中,我试图跟踪每个国家/地区代码、姓氏+土地 ID 的列表、每个支出类别(支出 1、支出 2 等)及其属性(功能、类别、区域)。
DataFrame 不是很大(少于 200 行),但它包含类别/地区/国家/地区代码以及姓氏及其支出类别(多对多)之间的几乎所有类型的组合。
我的挑战是,我无法弄清楚如何清楚地概念化我需要执行的步骤,以便正确准备 DataFrame 以导出到 Dict....
到目前为止我认为我需要:
- 一种根据“,”分隔符对“国家/地区代码”列的内容进行切片的方法:完成
- 根据唯一的国家/地区代码创建新列,并在预设该列代码的每行中有 1 个列:完成
- 将 DataFrame 的索引递归设置到每个新添加的列
- 将每个有数据的国家/地区代码的每一行移动到一个新的 DataFrame 中
- 将所有新的 DataFrame 导出到 Dict,然后合并它们
不确定步骤 3-6 是否是解决此问题的最佳方法,因为我仍然很难理解应如何针对我的情况配置 pd.DataFrame.to_dict (如果这甚至是可能的)...
非常感谢您在编码方面的帮助,同时也简要解释了您每个阶段的思维过程。
<小时/>这是我自己走了多远..
#keeping track of initial order of columns
initialOrder = list(df.columns.values)
# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]
# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row
CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:
CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
#print (CCodeIsPresent_rowIndex)
df.loc[CCodeIsPresent_rowIndex, c] = 1
# no clue what do do next ??
最佳答案
如果您将数据帧重新调整为正确的格式,您可以使用 @DSM 到 this question 的答案中方便的递归字典函数。 。目标是获得一个数据帧,其中每行仅包含一个“条目” - 您感兴趣的列的唯一组合。
首先,您需要将国家/地区代码字符串拆分为列表:
df['Country Code'] = df['Country Code'].str.split(',')
然后将这些列表展开为多行(使用 this question 中的 @RomanPekar 技术):
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
然后,您可以将 Spend*
列 reshape 为行,其中每个 Spend*
列都有一行,其中值不是 nan
.
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
现在您有了一个数据框,其中嵌套字典中的每个级别都是其自己的列。所以你可以使用这个递归字典函数:
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
并且仅将其应用于您想要生成嵌套字典的列,并按它们应嵌套的顺序列出:
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
这应该会产生您想要的结果。
<小时/>全部合而为一:
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
关于python - reshape pandas DataFrame 以在嵌套字典中导出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46872011/