python - 使用 numpy 数组按组 ID 创建序列

标签 python arrays python-3.x pandas numpy

我编写了一个函数来根据组 ID 获取 LSTM/GRU 序列模型的序列。我没有得到预期的输出。

Python 函数:

def windowGeneratorByID(data, target, id_col_index, lookback, offset, batch_size=16):
  min_index=0
  max_index = data.shape[0]-offset
  i = min_index + lookback
  while 1:
    if i + batch_size >= max_index:
      i = min_index + lookback
    rows = np.arange(i, min(i + batch_size, max_index))
    i += len(rows)
    samples = np.zeros((len(rows), lookback, data.shape[-1]))
    targets = np.zeros((len(rows), target.shape[-1]))

    for j, row in enumerate(rows):
      indices = range(rows[j] - lookback, rows[j])
      if data[rows[j] + offset][id_col_index] in set(data[indices][:, id_col_index]):
        if len(set(data[indices][:, id_col_index])) == 1:
            samples[j] = data[indices]
            targets[j] = target[rows[j] + offset]

    yield  np.delete(samples,id_col_index,axis=2) , targets

输入:

df=np.array([[1,1,0.1,11],[1,2,0.2,12], [1,3,0.3,13], [1,4,0.4,14], [2,5,0.5,15], [2,6,0.6,16], [2,7,0.7,17],[3,8,0.8,18],[3,9,0.9,19],[3,10,0.7,20]])

输出代码:

lookback=2
batch_size=2
offset = 0
windows = windowGeneratorByID(data=df, target=df[:,2:4],id_col_index=0 , offset=offset, lookback=lookback,batch_size=batch_size)

#The number of total batches are equal to the number of (training examples - lookback-offset)/batch_size 
no_batches=int((df.shape[0]-lookback-offset)/batch_size)

# #print the batches
for i in range(no_batches):
  #get the next batch from the windowGenerator
  input,output=next(windows)
  print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))

预期输出:

1th batch: 
input is:
[[[ 1.   0.1 11. ]
  [ 2.   0.2 12. ]]

 [[ 2.   0.2 12. ]
  [ 3.   0.3 13. ]]]
 and 
target is:
[[ 0.3 13. ]
 [ 0.4 14. ]]

2nd batch: 
input is:
[[[ 5.   0.5 15. ]
  [ 6.   0.6 16. ]]

 [[[ 8.   0.8 18. ]
  [ 9.   0.9 19. ]]
 and 
target is:
[[ 0.7 17. ]
 [ 0.7 20. ]]

最佳答案

这里有两种方法可以帮助您解决您要解决的问题。一种是像您这样的生成器方法,一次获取 1 个批处理,第二种是向量化 NumPy 方法,它一次对完整数据进行操作以获取所有批处理(此方法可用于 df 的 block 而不是完整的)。

生成器方法

  1. chunk,具有offsetlookback,基本上是一组行 X 到 y。所以如果我想要lookback 2offset 1。然后我需要 df 的 4 行。前 2 个将转到 X,最后一个将转到 y。同样,如果我需要 lookback 1 offset 0,那么我只需要 2 行。首先去 X,最后去 y。
  2. 有了这个理解,我可以计算出我可以使用滚动窗口从每个组中获得的 block 的最大数量,并将其存储在 c
  3. 一旦我有了这个,我只需要创建一个函数,让我滚动迭代 df 的行,选择 block 的数量,然后跳过一些,因为这些少数将包含来自不同组的元素。所以,如果我有 [0,1,2,3,4,5,6] 并且我有 c = [2,1,1] 并跳过(又名lookback+offset) = 1。然后我必须取 2,跳过 1,取 1,跳过 1,取 1,跳过 1。所以,[0,1,3,5] ,是我要迭代的。我将从这些索引中的每一个开始获取 block 的大小。
  4. 接下来非常简单。只需获取一个生成器设置来提取这些 block ,对于 batch size = n,提取 n 个 block 并在返回之前将它们堆叠起来。
df=np.array([[1,1,0.1,11],
             [1,2,0.2,12], 
             [1,3,0.3,13], 
             [1,4,0.4,14], 
             [2,5,0.5,15], 
             [2,6,0.6,16], 
             [2,7,0.7,17],
             [3,8,0.8,18],
             [3,9,0.9,19],
             [3,10,0.7,20]])

def take(xs, runs, skip_size):
    'https://stackoverflow.com/questions/65163947/iterate-over-a-list-based-on-list-with-set-of-iteration-steps'
    ixs = iter(xs)
    for run_size in runs:
        for _ in range(run_size ):
            yield next(ixs)
        for _ in range(skip_size):
            next(ixs)
            
def get_batch(df, target, lookback, offset, batch_size):
    _ , c = np.unique(df[:,0], return_counts=True)
    rows = (lookback+offset+1)
    w = c-rows+1
    itr = take(range(len(df)), w, lookback+offset)
    while 1:
        X, Y = [],[]
        for _ in range(batch_size):
            k = next(itr, 'out of batches!')
            x = df[k:lookback+k, 1:]
            y = df[rows+k-1:rows+k, target]
            X.append(x)
            Y.append(y)
        try: yield np.stack(X), np.stack(Y)
        except: break
            
            
lookback = 2
offset = 0
batch_size = 2
target = slice(2,4) #set the target as a slice instead of a separate df view

windows = get_batch(df, target, lookback, offset, batch_size)

no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)

for i in range(no_batches):
    input,output=next(windows)
    print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
#Lookback = 2, offset = 0, batch_size = 2 

1th batch: 
input is:
[[[ 1.   0.1 11. ]
  [ 2.   0.2 12. ]]

 [[ 2.   0.2 12. ]
  [ 3.   0.3 13. ]]]
 and 
target is:
[[[ 0.3 13. ]]

 [[ 0.4 14. ]]]

2th batch: 
input is:
[[[ 5.   0.5 15. ]
  [ 6.   0.6 16. ]]

 [[ 8.   0.8 18. ]
  [ 9.   0.9 19. ]]]
 and 
target is:
[[[ 0.7 17. ]]

另一个例子-

lookback = 1
offset = 1
batch_size = 1
target = slice(2,4) #set the target as a slice instead of a separate df view

windows = get_batch(df, target, lookback, offset, batch_size)

no_batches = int(np.sum(np.unique(df[:,0], return_counts=True)[1] - lookback - offset)/batch_size)

for i in range(no_batches):
    input,output=next(windows)
    print("{}th batch: \ninput is:\n{}\n and \ntarget is:\n{}\n".format(i+1, input, output))
    
#Lookback = 1, offset = 1, batch_size = 1

1th batch: 
input is:
[[[ 1.   0.1 11. ]]]
 and 
target is:
[[[ 0.3 13. ]]]

2th batch: 
input is:
[[[ 2.   0.2 12. ]]]
 and 
target is:
[[[ 0.4 14. ]]]

3th batch: 
input is:
[[[ 5.   0.5 15. ]]]
 and 
target is:
[[[ 0.7 17. ]]]

4th batch: 
input is:
[[[ 8.   0.8 18. ]]]
 and 
target is:
[[[ 0.7 20. ]]]

向量化 NumPy 方法

但是,如果您可以一次对所有数据使用矢量化 NumPy 计算,而不是生成器方法,我也编写了以下内容。 如果 df 很大,那么您可以简单地将 df block 传递给此函数,并为该 block 获取一组批处理。

  1. 根据id_column将数组分成不等长的组
  2. 使用步幅技巧在 axis=0 上获取滚动窗口
  3. 将所有窗口堆叠成一个 block
  4. 计算可能的批处理数
  5. 只保留可以成功堆叠成大小相等的批处理的 block 数
  6. 按批处理拆分块并得到 X
  7. 按批处理拆分块并得到y
  8. 在单个数组中以批处理形式返回所有 X、y
df=np.array([[1,1,0.1,11],
             [1,2,0.2,12], 
             [1,3,0.3,13], 
             [1,4,0.4,14], 
             [2,5,0.5,15], 
             [2,6,0.6,16], 
             [2,7,0.7,17],
             [3,8,0.8,18],
             [3,9,0.9,19],
             [3,10,0.7,20]])

lookback=1
batch_size=2
offset = 1

def window_split_2d(g, window):
    shp = (g.shape[0] - window + 1, window, g.shape[-1])
    strd = (g.strides[0], g.strides[0], g.strides[1])
    return np.lib.stride_tricks.as_strided(g, shape=shp, strides=strd)

def get_batches_vectorized(df, target, id_col_index, lookback, offset, batch_size):

    #Break array into unequal length groups based on id_column
    groups = np.split(df, np.where(np.diff(df[:,id_col_index]))[0]+1)
    
    #Get rolling windows over axis=0 using stride tricks
    chunks = [window_split_2d(i,lookback+offset+1) for i in groups]
    
    #Stack all the windows into a block
    block = np.vstack(chunks)
    
    #Calculate number of batches possible
    n_batches = block.shape[0]//batch_size
    
    #Keep only the number of blocks that can successfully be stacked into equal sized batches
    keep = block.shape[0]-(block.shape[0]%batch_size)
    block = block[:keep]
    
    #Split block by num batches and get X
    X = np.split(block[:,:lookback,1:], n_batches)

    #Split block by num batches and get y
    y = np.split(block[:,-1,target], n_batches)
    return X, y

关于python - 使用 numpy 数组按组 ID 创建序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65043850/

相关文章:

python - 如何确定 CherryPy 是否正在缓存响应?

python - Django:Celery Worker 未启动(没有任何错误)

python - 吉普 错误!堆栈错误 : Can't find Python executable "C:\Users\Admin\Anaconda3\python.EXE", 您可以设置 PYTHON 环境变量

java - 将整数数组存储到列表整数数组中并按顺序访问这些变量

c++ - 在包含数组中存在的所有元素的数组中找到最短的范围

python - 如何从字典中的一个键获取唯一值

C : Insert/get element in/from void array

python - 如何正确设置 MYPYPATH 以获取 mypy 的 stub ?

python - 为什么 __init__.py 将文件夹中的所有文件作为模块导入?

python - 每个值多个键