python - 如何将pandas数据框的单列拆分为带有组的多列?

标签 python pandas dataframe

我是 python pandas 的新手。我有一个如下所示的数据框:

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
               'age': ['25', '22','21','32','37','26','24','30']})
print df

       Name age
0  football  25
1    ramesh  22
2    suresh  21
3    pankaj  32
4   cricket  37
5    rakesh  26
6     mohit  24
7    mahesh  30

“名称”列还包含“运动名称”和“运动人物名称”。我想将其分成两个不同的列,如下所示:

预期输出:

sports_name sport_person_name age
football    ramesh            25
            suresh            22
            pankaj            32
cricket     rakesh            26
            mohit             24
            mahesh            30

如果我在“名称”列上进行分组,我不会得到预期的输出,并且它显然是直接输出,因为“名称”列中没有重复项。我需要使用什么才能获得预期的输出?

编辑:如果不想对运动名称进行硬编码

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
           'age': ['', '22','21','32','','26','24','30']})

df = df.replace('', np.nan, regex=True)

nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()

df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)

我刚刚检查了除了“名称”列之外的所有其余列中包含 NAN 值的行,并且它肯定是体育名称。我创建了该运动名称的列表,并利用以下解决方案创建 sports_name 和 sports_person_name 列。

最佳答案

您可以使用:

#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

DataFrame.insert 类似的解决方案- 那么不需要重新排序:

#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

如果只需要一个 sport 值,请将 limit=1 添加到 ffill 并将 NaN 替换为空字符串:

sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1                        suresh  21
2                        pankaj  32
3     cricket            rakesh  26
4                         mohit  24
5                        mahesh  30

关于python - 如何将pandas数据框的单列拆分为带有组的多列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46149019/

相关文章:

python - 在 Common Lisp 中延迟生成质数

python - 如何创建具有不同值的新表,但从另一列中选择最大值

pandas - 通过 PANDAS 数据框的条件表达式选择子集,但出现错误

dataframe - 在同一列中分配过滤值的结果不正确

python - Google Cloud ML FAILED_PRECONDITION

python - django request.POST 包含<无法解析>

python - 使用 rpy2 来自 pandas DataFrame 的分位数回归模型中的非一致性数组

python - Pandas:将索引值列表应用于数据框

r - 将多列合二为一的复杂数据框转换

python - 将 DataFrameGroupBy 对象中的每个分组列转换为列表