python - 通过用向量化替换 lambda x 来增强排序函数的性能

我有一个排名函数，我将它应用于需要几分钟才能运行的数百万行的大量列。通过删除为 .rank( 的应用程序准备数据的所有逻辑方法，即通过这样做:

ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))

我设法将其缩短到几秒钟。但是，我需要保留我的逻辑，并且正在努力重组我的代码:最终，最大的瓶颈是我对 lambda x: 的双重使用，但显然其他方面正在减慢速度(见下文)。我提供了一个示例数据框，以及下面的排名函数，即 MCVE。总的来说，我认为我的问题归结为:

(i) 如何更换.apply(lambda x在具有快速矢量化等效项的代码中的用法？ (ii) 如何循环遍历多索引、分组的数据框并应用函数？就我而言，对于 date_id 和 category 列的每个唯一组合。
(iii) 我还能做些什么来加快我的排名逻辑？主要开销似乎在 .value_counts() .这与上述 (i) 重叠；在发送排名之前，也许可以通过构建临时列在 df 上完成大部分逻辑。同样，可以在一次调用中对子数据帧进行排名吗？
(iv) 为什么使用 pd.qcut()而不是 df.rank() ?后者是 cythonized 并且似乎对关系有更灵活的处理，但我看不到两者之间的比较，和 pd.qcut()似乎使用最广泛。

样本输入数据如下:

import pandas as pd
import numpy as np
import random

to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')

这两个排名函数是:

def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
    #extra data tidying logic here beyond scope of question - can remove
    ranked = df[to_rank].apply(lambda x: f(x))
    return ranked


def f(x):
    nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
    sub_df = x.dropna() # 
    nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50

    if len(sub_df.index) == 0: #check not all nan.  If no non-nan data, then return with rank 50
        return nans_ranked

    if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
        sub_df[:] = 50
        return sub_df

    #Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
    max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
    max_bins = len(sub_df) / max_cluster 

    if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
        max_bins = 100

    if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
        sub_df[:] = 50
        return sub_df

    bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)

    sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut.  pd.rank( seems to have extra functionality, but overheads similar in practice
    sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins.  E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
    ranked_df = pd.concat([sub_df_ranked, nans_ranked])
    return ranked_df

调用我的排名函数并与 df 重新组合的代码是:

# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]

ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank)) 
ranked.columns = ranked_cols        
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)    
df = df.join(ranked[ranked_cols])

我试图通过删除两个 lambda x 调用来尽可能快地获得这个排名逻辑；我可以删除 rank_fun 中的逻辑，以便只有 f(x) 的逻辑适用，但我也不知道如何以矢量化方式处理多索引数据帧。另一个问题是关于 pd.qcut( 之间的差异。和 df.rank( :似乎两者都有不同的处理关系的方式，但开销似乎相似，尽管 .rank( 被 cythonized ；这可能是误导，因为主要开销是由于我使用 lambda x 造成的。

我跑了%lprun在 f(x)这给了我以下结果，尽管主要开销是使用 .apply(lambda x而不是矢量化方法:

Line # Hits Time Per Hit % 时间线内容

 2                                           def tst_fun(df, field):
 3         1          685    685.0      0.2      x = df[field]
 4         1        20726  20726.0      5.8      nans = x[np.isnan(x)]
 5         1        28448  28448.0      8.0      sub_df = x.dropna()
 6         1          387    387.0      0.1      nans_ranked = nans.replace(np.nan, 50)
 7         1            5      5.0      0.0      if len(sub_df.index) == 0: 
 8                                                   pass #check not empty.  May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
 9                                                   return nans_ranked
10                                           
11         1        65559  65559.0     18.4      if len(sub_df.unique()) == 1: 
12                                                   sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13                                                   return sub_df
14                                           
15                                               #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16         1        74610  74610.0     20.9      max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17                                               # print(counts)
18         1            9      9.0      0.0      max_bins = len(sub_df) / max_cluster #
19                                           
20         1            3      3.0      0.0      if max_bins > 100: 
21         1            0      0.0      0.0          max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22                                           
23                                           
24         1            0      0.0      0.0      if max_bins < 5: 
25                                                   sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26                                           
27                                               #     return sub_df
28                                           
29         1            1      1.0      0.0      bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30                                           
31                                               #should track bin resolution for all data.  To add.
32                                           
33                                               #if get here, then neither nans_ranked, nor sub_df are empty
34                                               # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35         1       160530 160530.0     45.0      sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36                                           
37         1         5777   5777.0      1.6      ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38                                               
39         1            1      1.0      0.0      return ranked_df

最佳答案

我建议你试试这个代码。它比你的快 3 倍，而且更清晰。

等级函数:

def rank(x):
    counts = x.value_counts()
    bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
    bins = 100 if bins > 100 else bins
    if bins < 5:
        return x.apply(lambda x: 50)
    else:
        return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)

单线程适用:

for col in to_rank:
    df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)

多线程应用:

import sys
from multiprocessing import Pool

def tfunc(col):
    return df.groupby(['date_id', 'category'])[col].apply(rank)

pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
    df[col + '_ranked'] = val

关于python - 通过用向量化替换 lambda x 来增强排序函数的性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44030936/

python - 通过用向量化替换 lambda x 来增强排序函数的性能

上一篇：python - PyGame 中的 Devanagari 文本呈现不正确

下一篇：python - pytest 是否具有 assertItemsEqual/assertCountEqual 等价物