python - 加快附近组的计算?

标签 python performance pandas numpy search

我有一个包含组 ID、两个距离度量(经度/纬度类型度量)和一个值的数据框。对于一组给定的距离,我想找到附近其他组的数量,以及附近其他组的平均值。

我已经编写了以下代码,但它的效率太低,以至于对于非常大的数据集,它根本无法在合理的时间内完成。附近零售商的计算很快。但是计算附近零售商的平均值非常慢。有没有更好的方法来提高效率?

distances = [1,2]

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
                  columns=['Group','Dist1','Dist2','Value'])

# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()

# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])

# find points within a given radius
for i in distances:
    closeby = tree.query_ball_tree(tree, r=i)

    # put into density column
    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]

    # get average values of nearby groups
    for idx, val in enumerate(df_groups.index):
        val_idx = df_groups.iloc[closeby[idx]].index.values
        mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
        df_groups.loc[val, str(i) + '_mean_values'] = mean

    # merge back to dataframe
    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles', 
                                 str(i) + '_mean_values']], 
                  left_on='Group', 
                  right_index=True)

最佳答案

很明显,问题出在使用 isin 方法索引主数据框。随着数据帧长度的增长,必须进行更大的搜索。我建议您在较小的 df_groups 数据框上进行相同的搜索,并改为计算更新后的平均值。

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
                  columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
                                                                  'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
 #Rename columns 
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
                          'Count'},inplace=True)


# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])

for i in distances:
    closeby = tree.query_ball_tree(tree, r=i)
    # put into density column
    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
    #create column to look for subsets
    df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
    #set this column to prep updated mean calculation
    df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']

    #perform updated mean
    df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
                                          df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
                                 str(i) + '_mean_values']],
                  left_on='Group',
                  right_index=True)

更新均值的公式是 (m1*n1 + m2*n2)/(n1+n2)

old setup 

100000 rows
%timeit old(df)
1 loop, best of 3: 694 ms per loop

1000000 rows
%timeit old(df)
1 loop, best of 3: 6.08 s per loop

10000000 rows
%timeit old(df)
1 loop, best of 3: 6min 13s per loop

新设置

100000 rows
%timeit new(df)
10 loops, best of 3: 136 ms per loop

1000000 rows
%timeit new(df)
1 loop, best of 3: 525 ms per loop

10000000 rows
%timeit new(df)
1 loop, best of 3: 4.53 s per loop

关于python - 加快附近组的计算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45373501/

相关文章:

python - 在 for 循环 django 模板中添加列表限制

python - 按数据框中的一列进行分组,但将其中一些组汇总为一组

python - 根据另一个系列,带有旗帜的新 Pandas 系列

python - Pandas 情节不重叠

python - 使用 BeautifulSoup 翻译 XLIFF 文件

javascript - V8 中 Javascript 方法的时间复杂度

javascript - 自定义延迟加载 - IE9 内存泄漏

sql - 优化PostgreSQL以进行快速测试

python - 如何在 pandas 中过滤列

python - 结构化数组排序的性能(numpy)