python - Pandas groupby - 对每组中的一半记录应用不同的函数

标签 python pandas group-by

我有类似以下数据框的内容,其中有街道地址范围和街道名称的非唯一组合。

import pandas as pd
df=pd.DataFrame()
df['BlockRange']=['100-150','100-150','100-150','100-150','200-300','200-300','300-400','300-400','300-400']
df['Street']=['Main','Main','Main','Main','Spruce','Spruce','2nd','2nd','2nd']
df
  BlockRange  Street
0    100-150    Main
1    100-150    Main
2    100-150    Main
3    100-150    Main
4    200-300  Spruce
5    200-300  Spruce
6    300-400     2nd
7    300-400     2nd
8    300-400     2nd

在 3 个“组”中的每一个 - (100-150, Main)、(200-300, Spruce) 和 (300-400, 2nd) - 我希望每个组中的一半记录获得一个 block 等于街区范围中点的数字和一半的记录得到一个等于街区范围中点加 1 的街区编号(就像把它放在街道的另一边)。

我知道这应该可以使用 groupby 转换来完成,但我不知道该怎么做(我在将函数应用于 groupby 键“BlockRange”时遇到问题)。

我只能通过遍历每个独特的组来获得我正在寻找的结果,这在我的完整数据集上运行时需要一段时间。请参阅下面我当前的解决方案和我正在寻找的最终结果:

groups=df.groupby(['BlockRange','Street'])

#Write function that calculates the mid point of the block range
def get_mid(x):
    block_nums=[int(y) for y in x.split('-')]
    return sum(block_nums)/len(block_nums)

final=pd.DataFrame()
for groupkey,group in groups:
    block_mid=get_mid(groupkey[0])
    halfway_point=len(group)/2
    group['Block']=0
    group.iloc[:halfway_point]['Block']=block_mid
    group.iloc[halfway_point:]['Block']=block_mid+1
    final=final.append(group)

final
  BlockRange  Street  Block
0    100-150    Main    125
1    100-150    Main    125
2    100-150    Main    126
3    100-150    Main    126
4    200-300  Spruce    250
5    200-300  Spruce    251
6    300-400     2nd    350
7    300-400     2nd    351
8    300-400     2nd    351

关于如何更有效地执行此操作,有什么建议吗?也许使用 groupby 转换?

最佳答案

您可以使用 apply使用自定义函数 f:

def f(x):
    df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist()])
    df = df.astype(int)
    block_nums = df.sum(axis=1) / 2
    x['Block'] = block_nums[0]
    halfway_point=len(x)/2
    x.iloc[halfway_point:, 2] = block_nums[0] + 1
    return x

print df.groupby(['BlockRange','Street']).apply(f)

  BlockRange  Street  Block
0    100-150    Main    125
1    100-150    Main    125
2    100-150    Main    126
3    100-150    Main    126
4    200-300  Spruce    250
5    200-300  Spruce    251
6    300-400     2nd    350
7    300-400     2nd    351
8    300-400     2nd    351  

时间:

In [32]: %timeit orig(df)
__main__:26: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:27: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
__main__:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
1 loops, best of 3: 290 ms per loop

In [33]: %timeit new(df)
100 loops, best of 3: 10.2 ms per loop  

测试:

print df
df1 = df.copy()

def orig(df):
    groups=df.groupby(['BlockRange','Street'])

    #Write function that calculates the mid point of the block range
    def get_mid(x):
        block_nums=[int(y) for y in x.split('-')]
        return sum(block_nums)/len(block_nums)
    final=pd.DataFrame()

    for groupkey,group in groups:
        block_mid=get_mid(groupkey[0])
        halfway_point=len(group)/2
        group['Block']=0
        group.iloc[:halfway_point]['Block']=block_mid
        group.iloc[halfway_point:]['Block']=block_mid+1
        final=final.append(group)
    return final    

def new(df):
    def f(x):
        df = pd.DataFrame([y.split('-') for y in x['BlockRange'].tolist() ])
        df = df.astype(int)
        block_nums = df.sum(axis=1) / 2
        x['Block'] = block_nums[0]
        halfway_point=len(x)/2
        x.iloc[halfway_point:, 2] = block_nums[0] + 1
        return x

    return df.groupby(['BlockRange','Street']).apply(f)

print orig(df)
print new(df1)   

关于python - Pandas groupby - 对每组中的一半记录应用不同的函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35298439/

相关文章:

python - 套接字错误 "[Errno 9] Bad file descriptor"可能是什么原因

python - 用不同版本的 Pandas 读泡菜

python - 分组依据 占总数的百分比

mysql - mysql中按年、月分组时出错

python - 如何在 Ansible 2.7 或 2.8 中循环遍历多级 dict 或 yaml?

python - 从数据列表中创建矩阵列表

Python:如何使用 networkx 创建与另一个数据框对应的图形?

mysql - 使用 GROUP BY 从 MySql 数据库中获取第二高的值

python - 如何杀死这个threading.Timer?

python-2.7 - 如何按列值对数据框进行排序?