python - 调整列表的 numpy 数组的大小，以便所有列表都具有相同的长度，并且可以正确推断 numpy 数组的 dtype

我目前有以下数据框

data = {'col_a': [['a', 'b'], ['a', 'b', 'c'], ['a'], ['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 'b', 'c', 'd']],
        'col_b':[[1, 3], [1, 0, 0], [4], [1, 1, 2, 0], [0, 0, 5], [3, 1, 2, 5]]}
df= pd.DataFrame(data)

假设我使用 col_a，我想以矢量化方式调整 col_a 中的列表大小，以便所有子列表的长度 = 最大长度最大列表，并且在 col_a 的情况下，我想用 'None' 填充空值。我希望最终输出如下所示

                   col_a               col_b
0     [a, b, None, None]    [1, 3, nan, nan]
1        [a, b, c, None]      [1, 0, 0, nan]
2  [a, None, None, None]  [4, nan, nan, nan]
3           [a, b, c, d]        [1, 1, 2, 0]
4        [a, b, c, None]      [0, 0, 5, nan]
5           [a, b, c, d]        [3, 1, 2, 5]

到目前为止我已经完成了以下工作

# Convert the column to a NumPy array with object dtype
col_np = df['col_a'].to_numpy()

# Find the maximum length of the lists using NumPy operations
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# Create a mask for padding
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]

# Pad the lists with None where necessary
result = np.where(mask, col_np, 'None')

这会导致以下错误 ValueError:操作数无法与形状 (6,4) (6,) () 一起广播

我觉得我已经很接近了，但我还缺少一些东西。请注意，只有矢量化解决方案才会被标记为答案。

最佳答案

只有矢量化解决方案才会被标记为答案。 -> 这太糟糕了，因为使用列表数组不可能实现(真正的)矢量化方法。从这个意义上来说，np.frompyfunc 肯定不是真正的矢量化。

如果“向量化”是指没有显式的 python 循环，则可以使用:

df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())

具有显式循环的替代方案是:

size = df['col_a'].str.len().max()

df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]

输出:

          col_a         col_b                  out_a
0        [a, b]        [1, 3]     [a, b, None, None]
1     [a, b, c]     [1, 0, 0]        [a, b, c, None]
2           [a]           [4]  [a, None, None, None]
3  [a, b, c, d]  [1, 1, 2, 0]           [a, b, c, d]
4     [a, b, c]     [0, 0, 5]        [a, b, c, None]
5  [a, b, c, d]  [3, 1, 2, 5]           [a, b, c, d]

时间

对于小列表，“矢量化”和循环解决方案具有非常相似的时序。

此处包含 1 到 10 项的列表:

但是，当列表大小增加时，Python 循环会变得更加高效。

对于包含 0 到 50 项的列表:

0 到 200 项:

0 到 2000 项:

用于计时的代码:

import pandas as pd
import perfplot
import numpy as np

def pandas_vectorized(df):
    df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())
    
def python_loop(df):
    size = df['col_a'].str.len().max()
    df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]

MAX_LIST_SIZE = 2000
    
perfplot.show(
    setup=lambda n: pd.DataFrame({'col_a': [['x']*n for n in np.random.randint(0, MAX_LIST_SIZE, size=n)]}),
    kernels=[pandas_vectorized, python_loop],
    n_range=[2**k for k in range(1, 18)],  # upper bound was 22 for small lists
    xlabel="len(df)",
    equality_check=None,
    max_time=10,
)

关于python - 调整列表的 numpy 数组的大小，以便所有列表都具有相同的长度，并且可以正确推断 numpy 数组的 dtype，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76729250/

python - 调整列表的 numpy 数组的大小，以便所有列表都具有相同的长度，并且可以正确推断 numpy 数组的 dtype

时间

上一篇：c# - 如何从字典对象中获取值？

下一篇：c++ - 为什么 ASIO 套接字 open() 会失败？