python - reshape pandas DataFrame 以在嵌套字典中导出

给定以下数据框:

   Category Area               Country Code Function Last Name     LanID  Spend1  Spend2  Spend3  Spend4  Spend5
0      Bisc   EE                  RU02,UA02       Mk     Smith    df3432     1.0     NaN     NaN     NaN     NaN
1      Bisc   EE                       RU02       Mk      Bibs    fdss34     1.0     NaN     NaN     NaN     NaN
2      Bisc   EE               UA02,EURASIA       Mk      Crow   fdsdr43     1.0     NaN     NaN     NaN     NaN
3      Bisc   WE                       FR31       Mk     Ellis   fdssdf3     1.0     NaN     NaN     NaN     NaN
4      Bisc   WE                  BE32,NL31       Mk     Mower   TOZ1720     1.0     NaN     NaN     NaN     NaN
5      Bisc   WE             FR31,BE32,NL31      LKU      Elan   SKY8851     1.0     1.0     1.0     1.0     1.0
6      Bisc   SE                       IT31       Mk    Bobret    3dfsfg     1.0     NaN     NaN     NaN     NaN
7      Bisc   SE                       GR31       Mk   Concept  MOSGX009     1.0     NaN     NaN     NaN     NaN
8      Bisc   SE   RU02,IT31,GR31,PT31,ES31      LKU     Solar   MSS5723     1.0     1.0     1.0     1.0     1.0
9      Bisc   SE        IT31,GR31,PT31,ES31       Mk      Brix    fdgd22     NaN     1.0     NaN     NaN     NaN
10     Choc   CE   RU02,CZ31,SK31,PL31,LT31      Fin    Ocoser    43233d     NaN     1.0     NaN     NaN     NaN
11     Choc   CE        DE31,AT31,HU31,CH31      Fin     Smuth     4rewf     NaN     1.0     NaN     NaN     NaN
12     Choc   CE              BG31,RO31,EMA      Fin    Momocs    hgghg2     NaN     1.0     NaN     NaN     NaN
13     Choc   WE             FR31,BE32,NL31      Fin   Bruntly    ffdd32     NaN     NaN     NaN     NaN     1.0
14     Choc   WE             FR31,BE32,NL31       Mk      Ofer  BROGX011     NaN     1.0     1.0     NaN     NaN
15     Choc   WE             FR31,BE32,NL31       Mk       Hem   NZJ3189     NaN     NaN     NaN     1.0     1.0
16      G&C   NE                  UA02,SE31       Mk       Cre   ORY9499     1.0     NaN     NaN     NaN     NaN
17      G&C   NE                       NO31       Mk      Qlyo   XVM7639     1.0     NaN     NaN     NaN     NaN
18      G&C   NE   GB31,NO31,SE31,IE31,FI31       Mk      Omny   LOX1512     NaN     1.0     1.0     NaN     NaN

我想将其导出到具有以下结构的嵌套字典中:

    {RU02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {fdss34:     Bibs}
            {Bisc:  {SE:    {LKU:   {Spend1:    {MSS5723:   Solar}
                                    {Spend2:    {MSS5723:   Solar}
                                    {Spend3:    {MSS5723:   Solar}
                                    {Spend4:    {MSS5723:   Solar}
                                    {Spend5:    {MSS5723:   Solar}
            {Choc:  {CE:    {Fin:   {Spend2:    {43233d:   Ocoser}
            .....

    {UA02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {ffdsdr43:   Crow}
            {G&C:   {NE:    {Mkt:   {Spend1:    {ORY9499:     Cre}
    .....

本质上，在这个字典中，我试图跟踪每个国家/地区代码、姓氏+土地 ID 的列表、每个支出类别(支出 1、支出 2 等)及其属性(功能、类别、区域)。

DataFrame 不是很大(少于 200 行)，但它包含类别/地区/国家/地区代码以及姓氏及其支出类别(多对多)之间的几乎所有类型的组合。

我的挑战是，我无法弄清楚如何清楚地概念化我需要执行的步骤，以便正确准备 DataFrame 以导出到 Dict....

到目前为止我认为我需要:

一种根据“,”分隔符对“国家/地区代码”列的内容进行切片的方法:完成
根据唯一的国家/地区代码创建新列，并在预设该列代码的每行中有 1 个列:完成
将 DataFrame 的索引递归设置到每个新添加的列
将每个有数据的国家/地区代码的每一行移动到一个新的 DataFrame 中
将所有新的 DataFrame 导出到 Dict，然后合并它们

不确定步骤 3-6 是否是解决此问题的最佳方法，因为我仍然很难理解应如何针对我的情况配置 pd.DataFrame.to_dict (如果这甚至是可能的)...

非常感谢您在编码方面的帮助，同时也简要解释了您每个阶段的思维过程。

<小时/>

这是我自己走了多远..

#keeping track of initial order of columns
initialOrder = list(df.columns.values)

# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]

# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])

# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]


# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row

CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:   
    CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]

    #print (CCodeIsPresent_rowIndex)
    df.loc[CCodeIsPresent_rowIndex, c] = 1

# no clue what do do next ??

最佳答案

如果您将数据帧重新调整为正确的格式，您可以使用 @DSM 到 this question 的答案中方便的递归字典函数。。目标是获得一个数据帧，其中每行仅包含一个“条目” - 您感兴趣的列的唯一组合。

首先，您需要将国家/地区代码字符串拆分为列表:

df['Country Code'] = df['Country Code'].str.split(',')

然后将这些列表展开为多行(使用 this question 中的 @RomanPekar 技术):

s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

然后，您可以将 Spend* 列 reshape 为行，其中每个 Spend* 列都有一行，其中值不是 nan .

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

现在您有了一个数据框，其中嵌套字典中的每个级别都是其自己的列。所以你可以使用这个递归字典函数:

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

并且仅将其应用于您想要生成嵌套字典的列，并按它们应嵌套的顺序列出:

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

这应该会产生您想要的结果。

<小时/>

全部合而为一:

df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

关于python - reshape pandas DataFrame 以在嵌套字典中导出，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46872011/

python - reshape pandas DataFrame 以在嵌套字典中导出

上一篇：python - 如何从 Reddit 子版 block 中获取随机帖子(praw)

下一篇：python - 如何使用 mongoengine 更新嵌入列表中的特定对象？