python - 封装矢量化函数 - 用于 Panda DataFrames

我一直在重构一些代码，并用它来探索如何在使用 Pandas 和 Numpy 时构建可维护、灵活、简洁的代码。 (通常我只会短暂地使用它们，我现在的角色应该是成为一名前冲刺者。)

我遇到的一个例子是一个函数，有时可以对一列值调用，有时可以对三列值调用。使用 Numpy 的矢量化代码完美地封装了它。但使用起来有点笨拙。

我应该如何“更好”地编写以下函数？

def project_unit_space_to_index_space(v, vertices_per_edge):
    return np.rint((v + 1) / 2 * (vertices_per_edge - 1)).astype(int)


input = np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0)

index_space = project_unit_space_to_index_space(input, 42)

magic_space = some_other_transformation_code(index_space, foo, bar)

df['x_'], df['y_'], df['z_'] = magic_space

按照编写，该函数可以接受一列数据或多列数据，并且它仍然可以正确、快速地工作。

返回类型是直接传递到另一个类似结构的函数的正确形状，使我能够整齐地链接函数。

即使将结果分配回数据框中的新列也不是“糟糕”，尽管它有点笨拙。

但是将输入打包为单个 np.ndarray 确实非常非常笨重。

我还没有找到任何涵盖这一点的风格指南。它们都是关于 itterrows 和 lambda 表达式等。但我没有找到封装此类逻辑的最佳实践。

那么，您您如何构建上述代码？

编辑:用于整理输入的各种选项的时间

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].unstack().to_numpy())                      
# 1.44 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].to_numpy().T)                              
# 558 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].transpose().to_numpy())                    
# 817 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0))   
# 3.46 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最佳答案

In [101]: df = pd.DataFrame(np.arange(12).reshape(4,3))                         
In [102]: df                                                                    
Out[102]: 
   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

您正在从数据帧的 n 列创建一个 (n,m) 数组:

In [103]: np.concatenate([[df[0]],[df[1]],[df[2]]],0)                           
Out[103]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

更紧凑的方法是转置这些列的数组:

In [104]: df.to_numpy().T                                                       
Out[104]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

数据框有自己的转置:

In [109]: df.transpose().to_numpy()                                             
Out[109]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

您的计算使用数据框，返回具有相同形状和索引的数据框:

In [113]: np.rint((df+1)/2 *(42-1)).astype(int)                                 
Out[113]: 
     0    1    2
0   20   41   62
1   82  102  123
2  144  164  184
3  205  226  246

一些numpy函数将输入转换为numpy数组并返回一个数组。其他人通过将详细信息委托(delegate)给 pandas 方法，可以直接处理数据帧并返回数据帧。

关于python - 封装矢量化函数 - 用于 Panda DataFrames，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62233061/

python - 封装矢量化函数 - 用于 Panda DataFrames

上一篇：使用 neo4j-admin 导入加载后 Neo4j 数据库不可见

下一篇：python-3.x - 如何在 sympy 中表示不同的非数字符号？