python - 将 Pandas 数据框转换为固定大小的段数组

标签 python python-3.x pandas multidimensional-array numpy-ndarray

我正在努力将我的数据帧转换为一组固定大小的片段,我应该将这些片段提供给卷积神经网络。具体来说,我想将 df 转换为 m 数组列表,每个数组包含大小为 (1,5,4) 的段。所以最后,我会得到一个 (m,1,5,4) 数组。

为了澄清我的问题,我解释了如何使用此 MWE。假设这是我的 df:

df = {
    'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }

df = pd.DataFrame.from_dict(df)

为此,我使用了这个函数:

def df_transformer(dataframe, chunk_size=5):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])

    # loop over segments (id)
    for _, group in grouped:

        inputs = group.loc[:, 'speed':'bearing'].values
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = len(inputs) // chunk_size

        if N > 0:
            inputs = np.array_split(inputs, [chunk_size]*N)
        else:
            inputs = [inputs]
        
        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

上面的 df 有 12 行,所以如果正确转换为预期的形式,我应该得到一个形状为 (3,1,5,4) 的数组。在上面的函数中,少于 5 行的段被零填充,使段形 (1,5,4)

目前,这个函数我有两个问题:

  1. 该函数仅适用于我的 df 中小于 10 的行。

像这样(最后一行应该在下面用零填充):

X , y = df_transformer(df[:9])
X
array([[[[ 1.763e+01,  0.000e+00,  0.000e+00,  2.903e+01],
         [ 1.763e+01, -9.000e-02,  1.000e-02,  5.612e+01],
         [ 1.700e-01,  1.240e+00, -2.040e+00,  1.849e+01],
         [ 1.410e+00, -8.000e-01,  5.100e-01,  1.185e+01],
         [ 6.100e-01, -2.900e-01,  1.500e-01,  3.675e+01]]],


       [[[ 3.200e-01, -1.400e-01,  3.900e-01,  2.752e+01],
         [ 1.800e-01,  2.500e-01, -3.800e-01,  8.108e+01],
         [ 4.300e-01, -1.300e-01,  2.900e-01,  5.106e+01],
         [ 3.000e-01,  1.600e-01,  1.300e-01,  1.985e+01],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00]]]])

但在这种情况下引入了一个全零数组(段):

X , y = df_transformer(df[:10])
X
array([[[[ 1.763e+01,  0.000e+00,  0.000e+00,  2.903e+01],
         [ 1.763e+01, -9.000e-02,  1.000e-02,  5.612e+01],
         [ 1.700e-01,  1.240e+00, -2.040e+00,  1.849e+01],
         [ 1.410e+00, -8.000e-01,  5.100e-01,  1.185e+01],
         [ 6.100e-01, -2.900e-01,  1.500e-01,  3.675e+01]]],


       [[[ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00]]],


       [[[ 3.200e-01, -1.400e-01,  3.900e-01,  2.752e+01],
         [ 1.800e-01,  2.500e-01, -3.800e-01,  8.108e+01],
         [ 4.300e-01, -1.300e-01,  2.900e-01,  5.106e+01],
         [ 3.000e-01,  1.600e-01,  1.300e-01,  1.985e+01],
         [ 4.600e-01,  2.900e-01, -6.700e-01,  1.076e+01]]]])
  1. 如果我传递一个完整的 df,该函数将失败(我不明白这个错误,但它似乎与少于 5 行的段的填充有关)。

所以在这种情况下,我得到 index can't contain negative values 错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-1fc559db37eb> in <module>()
----> 1 X , y = df_transformer(df)

2 frames
<ipython-input-4-9e1c49985863> in df_transformer(dataframe, chunk_size)
     24             inpt = np.pad(
     25                 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
---> 26                 mode='constant')
     27             # add each inputs split to accumulators
     28             X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)

<__array_function__ internals> in pad(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
    746 
    747     # Broadcast to shape (array.ndim, 2)
--> 748     pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
    749 
    750     if callable(mode):

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
    517 
    518     if as_index and x.min() < 0:
--> 519         raise ValueError("index can't contain negative values")
    520 
    521     # Converting the array with `tolist` seems to improve performance

ValueError: index can't contain negative values

预期输出:

X , y = df_transformer(df)
X
array([[[[ 1.763e+01,  0.000e+00,  0.000e+00,  2.903e+01],
         [ 1.763e+01, -9.000e-02,  1.000e-02,  5.612e+01],
         [ 1.700e-01,  1.240e+00, -2.040e+00,  1.849e+01],
         [ 1.410e+00, -8.000e-01,  5.100e-01,  1.185e+01],
         [ 6.100e-01, -2.900e-01,  1.500e-01,  3.675e+01]]],

       [[[ 3.200e-01, -1.400e-01,  3.900e-01,  2.752e+01],
         [ 1.800e-01,  2.500e-01, -3.800e-01,  8.108e+01],
         [ 4.300e-01, -1.300e-01,  2.900e-01,  5.106e+01],
         [ 3.000e-01,  1.600e-01,  1.300e-01,  1.985e+01],
         [ 4.600e-01,  2.900e-01, -6.700e-01,  1.076e+01]]],

       [[[ 7.500e-01,  -3.800e-01,  6.500e-01,  1.451e+01],
         [ 3.700e-01,  2.700e-01,  5.200e-01,  2.427e+01],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00],
         [ 0.000e+00,  0.000e+00,  0.000e+00,  0.000e+00]]]])

有人可以帮我解决这个问题吗?上面的 WME 可以很好地重现这个错误。

编辑

RichieV 的答案也有一个错误。尽管它在给定的 MWE 中工作,但在下面的情况下它没有完成正确的任务(将 df 扩展两次

its size):
df = {
    'id': [1]*12+[2]*12,
    'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37]*2,
    'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27]*2,
    'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52]*2,
    'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27]*2,
    'label' : [3,3,3,3,3,3,3,3,3,3,3,3]*2 }
df = pd.DataFrame.from_dict(df)

X, y = df_transformer(df, chunk_size=5)
print(X[:3])

[[[[ 1.763e+01  0.000e+00  0.000e+00  2.903e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 3.700e-01  2.700e-01  5.200e-01  2.427e+01]]]


 [[[ 7.500e-01 -3.800e-01  6.500e-01  1.451e+01]
   [ 3.000e-01  1.600e-01  1.300e-01  1.985e+01]
   [ 4.600e-01  2.900e-01 -6.700e-01  1.076e+01]
   [ 1.800e-01  2.500e-01 -3.800e-01  8.108e+01]
   [ 3.200e-01 -1.400e-01  3.900e-01  2.752e+01]]]


 [[[ 6.100e-01 -2.900e-01  1.500e-01  3.675e+01]
   [ 1.410e+00 -8.000e-01  5.100e-01  1.185e+01]
   [ 1.700e-01  1.240e+00 -2.040e+00  1.849e+01]
   [ 1.763e+01 -9.000e-02  1.000e-02  5.612e+01]
   [ 4.300e-01 -1.300e-01  2.900e-01  5.106e+01]]]]

请注意第一个元素与答案中的不同(第 2、3 和 4 行全为零。

最佳答案

您可以填充 df 一次,而不是每次迭代都填充。

使用第二个 id 获取此数据

df = {
    'id': [1,1,1,1,1,1,1,1,1,2,2,2],
    'speed': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
    'acc': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
    'jerk': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
    'bearing': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
    'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df = pd.DataFrame.from_dict(df)
print(df)

    id  speed   acc  jerk  bearing  label
0    1  17.63  0.00  0.00    29.03      3
1    1  17.63 -0.09  0.01    56.12      3
2    1   0.17  1.24 -2.04    18.49      3
3    1   1.41 -0.80  0.51    11.85      3
4    1   0.61 -0.29  0.15    36.75      3
5    1   0.32 -0.14  0.39    27.52      3
6    1   0.18  0.25 -0.38    81.08      3
7    1   0.43 -0.13  0.29    51.06      3
8    1   0.30  0.16  0.13    19.85      3
9    2   0.46  0.29 -0.67    10.76      3
10   2   0.75 -0.38  0.65    14.51      3
11   2   0.37  0.27  0.52    24.27      3

代码

def df_transformer(df, chunk_size=5):
    ### pad df with 0's so len(df) is exactly a multiple of chunk_size
    df = pd.concat([df,
        pd.DataFrame([[id] + [0] * 5 # add row with zeros
            for id, ct in df.groupby('id').size().iteritems() # for each id
            for row in range(chunk_size - ct % chunk_size)] # as many times as needed
            , columns=df.columns)
    ]).sort_values('id', kind='mergesort', ignore_index=True)
    # print(df)
    X, y = [], []
    for _, group in df.groupby(df.index//5):
        X.append(group.iloc[:, 1:-1].values[np.newaxis, ...])
        y.append(group.iloc[0, -1]) # not sure how you want y to be structured
    return np.array(X), np.array(y)


X, y = df_transformer(df, chunk_size=5)
print(X)

输出

[[[[ 1.763e+01  0.000e+00  0.000e+00  2.903e+01]
   [ 1.763e+01 -9.000e-02  1.000e-02  5.612e+01]
   [ 1.700e-01  1.240e+00 -2.040e+00  1.849e+01]
   [ 1.410e+00 -8.000e-01  5.100e-01  1.185e+01]
   [ 6.100e-01 -2.900e-01  1.500e-01  3.675e+01]]]

 [[[ 3.200e-01 -1.400e-01  3.900e-01  2.752e+01]
   [ 1.800e-01  2.500e-01 -3.800e-01  8.108e+01]
   [ 4.300e-01 -1.300e-01  2.900e-01  5.106e+01]
   [ 3.000e-01  1.600e-01  1.300e-01  1.985e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]]]

 [[[ 4.600e-01  2.900e-01 -6.700e-01  1.076e+01]
   [ 7.500e-01 -3.800e-01  6.500e-01  1.451e+01]
   [ 3.700e-01  2.700e-01  5.200e-01  2.427e+01]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
   [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]]]]

请注意前两部分是如何来自 id==1 的,最后两部分来自 id==2 的,每个部分都有自己的零填充

关于python - 将 Pandas 数据框转换为固定大小的段数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63367761/

相关文章:

python - 利用 "isalpha"或 "startswith"和/或故障排除 "list index out of range error"

python-3.x - 为什么我应该给 `savetxt` 一个以二进制而不是文本模式打开的文件?

python - 基于现有列向 Pandas DataFrame 添加多个列

c# - Python 中 C# 的 GetEncoding ("28591") 相当于什么?

python - 如何通过对第 3 列中的值求和来将前 2 列中具有相同值的 Pandas Dataframe 行组合在一起?

python - 如何将 pandas 数据帧切片作为函数中的参数?

python - 将 float 转换为 int 并在四舍五入的情况下引发异常

python - setuptools,提前知道 native 库的轮文件名

python - 无法删除结果之间的巨大空格

python-3.x - 导入错误: cannot import name 'pairwise_distances_chunked'