Python Pandas : classifying values in column and making a new column

标签 python pandas dataframe

快速提问。 我正在尝试在 df 中创建一列,对其他列中的值进行分类。看看下面我的代码。

df['maker_grp'] = np.nan
for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
    df['maker_grp'][key] = 'Class1'
for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
    df['maker_grp'][key] = 'Class2'
df['maker_grp'] = df.maker_grp.fillna('Class3')

它工作得很好,但我只是觉得有一种Python式的方法可以用更少的处理来做到这一点。帮帮我。谢谢

最佳答案

使用numpy.select :

m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")

df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')

示例:

df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})
#print (df)

m1 = df['maker_nm'].str.contains("Sam|Mike")
m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")

df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
print (df)
  maker_nm maker_grp
0    Sam 1    Class1
1    Joe 5    Class3
2   Paul 7    Class2
3   Mike 0    Class1

如果有很多条件应用,使用自定义函数应该会更快:

import re

def f(x):
    p1 = re.compile("Sam|Mike")
    p2 = re.compile("Andy|John|Paul|Jay")
    if p1.match(x):
        return 'Class1'
    elif p2.match(x):
        return 'Class2'
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)

时间:

df = pd.DataFrame({'maker_nm':['Sam 1','Joe 5','Paul 7','Mike 0']})

df = pd.concat([df] * 1000, ignore_index=True)

#print (df)

In [117]: %%timeit
     ...: df['maker_grp'] = np.nan
     ...: for key in df[df['maker_nm'].str.contains("Sam|Mike")].index:
     ...:     df['maker_grp'][key] = 'Class1'
     ...: for key in df[df['maker_nm'].str.contains("Andy|John|Paul|Jay")].index:
     ...:     df['maker_grp'][key] = 'Class2'
     ...: df['maker_grp'] = df.maker_grp.fillna('Class3')
     ...: 

In [118]: %%timeit
     ...: m1 = df['maker_nm'].str.contains("Sam|Mike")
     ...: m2 = df['maker_nm'].str.contains("Andy|John|Paul|Jay")
     ...: 
     ...: df['maker_grp'] = np.select([m1,m2], ['Class1','Class2'], default='Class3')
     ...: 
100 loops, best of 3: 5.98 ms per loop

In [119]: %%timeit
     ...: df['maker_grp'] = df['maker_nm'].apply(f)
     ...: 
100 loops, best of 3: 7.38 ms per loop

警告:

性能实际上取决于数据和条件数量。

编辑:对于许多条件,检查子字符串更快应用:

m1 = df['maker_nm'].str.contains("Sam", regex=False)
m2 = df['maker_nm'].str.contains("Mike", regex=False)
m3 = df['maker_nm'].str.contains("Andy", regex=False)
m4 = df['maker_nm'].str.contains("John", regex=False)
m5 = df['maker_nm'].str.contains("Jay", regex=False)

df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
print (df)

def f(x):

    if 'Sam' in x:
        return 'Class1'
    elif 'Mike' in x:
        return 'Class1'
    elif 'Andy' in x:
        return 'Class2'
    elif 'John' in x:
        return 'Class2'
    elif 'Paul' in x:
        return 'Class2'
    elif 'Jay' in x:
        return 'Class2'  
    else:
        return 'Class3'

df['maker_grp'] = df['maker_nm'].apply(f)
print (df)

In [133]: %%timeit
     ...: m1 = df['maker_nm'].str.contains("Sam", regex=False)
     ...: m2 = df['maker_nm'].str.contains("Mike", regex=False)
     ...: m3 = df['maker_nm'].str.contains("Andy", regex=False)
     ...: m4 = df['maker_nm'].str.contains("John", regex=False)
     ...: m5 = df['maker_nm'].str.contains("Jay", regex=False)
     ...: 
     ...: df['maker_grp'] = np.select([m1,m2,m3,m4,m5], ['Class1','Class1', 'Class2','Class2','Class2'], default='Class3')
     ...: 
100 loops, best of 3: 5.79 ms per loop

In [134]: %%timeit
     ...: df['maker_grp'] = df['maker_nm'].apply(f)
     ...: 
1000 loops, best of 3: 1.41 ms per loop

关于Python Pandas : classifying values in column and making a new column,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48681310/

相关文章:

r - 如何根据另一列的另一个值在一列中收集数据

xml - 当节点只有属性时,如何将 XML 转换为 data.frame?

python - 从一组具有混合数字类型的范围创建 DataFrame

python - 在 Pandas Advice 中对两列数据进行切片并输出新值

python - 在python中重复一个变量

mysql - 如何添加缺少数据组合的行并用 0 估算相应字段

python-3.x - 数据分析 - 如何计算空值、NaN 和空字符串值?

python - 在 python 中过滤具有特定列名的 Pandas 数据框

python - 忽略 flake8 检查带有反斜杠的代码

python - 从 HTML 标签中包含的一系列字符串和不带标签的字符串中提取文本