python - 用排序索引替换 Pandas 列

标签 python pandas pandas-groupby

我有一个示例 DF,试图用升序排序索引替换列值列表:

DF:

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

    a   b   c    d1      d2       date
0   2   7   9   Apple   Orange  2002-01-01
1   6   0   9   Mango   lemon   2002-01-01
2   8   0   0   Apple   lemon   2002-01-01
3   4   4   4   Mango   Orange  2002-01-01
4   5   0   8   Mango   lemon   2002-02-01
5   6   1   6   Mango   Orange  2002-02-01
6   7   2   7   Apple   lemon   2002-02-01

第1步:
Group the DF by "date" column, sample group on "2002-01-01"


        a   b   c    d1      d2       date
    0   2   7   9   Apple   Orange  2002-01-01
    1   6   0   9   Mango   lemon   2002-01-01
    2   8   0   0   Apple   lemon   2002-01-01
    3   4   4   4   Mango   Orange  2002-01-01

第2步:

在该组中,替换列 ["d1","d2"] 的值带有基于 c 的排序平均值的索引(不是 DF 索引) .

比如上面的组mean(c, d1="Apple") = [9+0]/2 => 4.5mean(c, d1="Mango") = [9+4]/2 => 6.5所以ascending sorted indexApple:0Mango:1
所以列的值 d1将被替换如下:
            a   b   c   d1       d2       date
        0   2   7   9   0      Orange   2002-01-01
        1   6   0   9   1      lemon    2002-01-01
        2   8   0   0   0      lemon    2002-01-01
        3   4   4   4   1      Orange   2002-01-01

将此应用于整个 df .我有遍历组和每一行的蛮力方法,对更多 pandas 的任何建议基于解决方案将有助于提高效率。

最佳答案

这是您在 d1 列中寻找的内容吗?您也可以将一些类似的技术应用于 d2。虽然它不是最优雅的解决方案。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

df['mean_value'] = df.groupby(['date', 'd1'])['c'].transform(lambda x: np.mean(x))
df['rank_value'] = (df.groupby(['date'])['mean_value'].rank(ascending=True, method='dense') - 1).astype(int)
df['d1'] = df['rank_value']
df.drop(labels=['rank_value', 'mean_value'], axis=1, inplace=True)

df
   a  b  c  d1      d2       date
0  3  1  4   1  Orange 2002-01-01
1  9  7  5   0   lemon 2002-01-01
2  9  9  5   1   lemon 2002-01-01
3  8  1  2   0  Orange 2002-01-01
4  8  0  1   0   lemon 2002-02-01
5  1  8  3   0  Orange 2002-02-01
6  8  0  4   1   lemon 2002-02-01

关于python - 用排序索引替换 Pandas 列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62479547/

相关文章:

python - 有没有办法将 python pandas 数据框转换为 NLP 语料库或文档?

python - pweave 模块不生成图形

python - 在Python中使用多个分类值分割数据帧的标签以对标签进行编码

python - 在 Python Polars 中获取每个 groupby/apply 的相关性

python - 按用户定义的月份跨度对 pandas dataFrame 进行分组

python - 将 Pandas Dataframe 转换为特定格式

python - Django - 管理站点 - 如何保护它?

python - 使用 Python 在 Hadoop 中读取制表符分隔的文件

python - 如何将值递减的多行添加到一个系列中

python - Pandas Split DataFrame 使用行索引