python-3.x - 如何在傅立叶域中实现长信号的 Pytorch 一维互相关?

标签 python-3.x gpu fft pytorch cross-correlation

我有一系列长度为 n = 36,000 的信号,我需要对其执行互相关。目前,我在 numpy 中的 cpu 实现有点慢。我听说 Pytorch 可以大大加快张量运算,并提供了一种在 GPU 上并行执行计算的方法。我想探索这个选项,但我不太确定如何使用该框架完成此操作。

由于这些信号的长度,我更愿意在频域中执行互相关操作。

通常使用 numpy 我会像这样执行操作:

import numpy as np

signal_length=36000

# make the signals
signal_1 = np.random.uniform(-1,1, signal_length)
signal_2 = np.random.uniform(-1,1, signal_length)

# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1

# get optimized array length for fft computation
fast_length = np.fftpack.next_fast_len(x_cor_sig_length)

# move data into the frequency domain. axis=-1 to perform 
# along last dimension
fft_1 = np.fft.rfft(src_data, fast_length, axis=-1)
fft_2 = np.fft.rfft(src_data, fast_length, axis=-1)

# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_1 = np.conj(fft_1)


fft_multiplied = fft_1 * fft_2

# back to time domain. 
prelim_correlation = np.fft.irfft(result, x_corr_sig_length, axis=-1)

# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real

final_result = np.real(np.fft.fftshift(prelim_correlation),axes=-1)).astype(np.float64)

查看 Pytorch 文档,似乎没有 numpy.conj() 的等效项。我也不确定在 irfft 操作后是否/如何需要实现 fftshift。

那么您将如何使用傅立叶方法在 Pytorch 中编写一维互相关?

最佳答案

需要考虑的几件事。

Python 解释器非常慢,这些矢量化库所做的是将工作负载转移到本地实现。为了有所作为,您需要能够在一条 python 指令中执行许多操作。在 GPU 上评估事物遵循相同的原则,虽然 GPU 具有更多的计算能力,但将数据复制到 GPU 或从 GPU 复制数据的速度较慢。

我调整了您的示例以同时处理多个信号。

import numpy as np
def numpy_xcorr(BATCH=1, signal_length=36000):
    # make the signals
    signal_1 = np.random.uniform(-1,1, (BATCH, signal_length))
    signal_2 = np.random.uniform(-1,1, (BATCH, signal_length))

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length)

    # move data into the frequency domain. axis=-1 to perform 
    # along last dimension
    fft_1 = np.fft.rfft(signal_1, fast_length, axis=-1)
    fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1, fast_length, axis=-1)

    # take the complex conjugate of one of the spectrums. 
    fft_1 = np.conj(fft_1)


    fft_multiplied = fft_1 * fft_2

    # back to time domain. 
    prelim_correlation = np.fft.irfft(fft_multiplied, fast_length, axis=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = np.fft.fftshift(np.real(prelim_correlation), axes=-1)
    return final_result, np.sum(final_result)

从 torch 1.7 开始,我们有了 torch.fft提供类似于 numpy.fft 的接口(interface)的模块,缺少 fftshift 但可以使用 torch.roll 获得相同的结果.另一点是 numpy 默认使用 64 位精度,而 torch 将使用 32 位精度。

快速长度在于选择平滑数(那些被分解为小质数的数,我想你对这个主题很熟悉)。

def next_fast_len(n, factors=[2, 3, 5, 7]):
    '''
      Returns the minimum integer not smaller than n that can
      be written as a product (possibly with repettitions) of
      the given factors.
    '''
    best = float('inf')
    stack = [1]
    while len(stack):
        a = stack.pop()
        if a >= n:
            if a < best:
                best = a;
                if best == n:
                    break; # no reason to keep searching
        else:
            for p in factors:
                b = a * p;
                if b < best:
                    stack.append(b)
    return best;

然后 torch 实现开始

import torch;
import torch.fft
def torch_xcorr(BATCH=1, signal_length=36000, device='cpu', factors=[2,3,5], dtype=torch.float):
    signal_length=36000
    # torch.rand is random in the range (0, 1)
    signal_1 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
    signal_2 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)

    # just make the cross correlation more interesting
    signal_2 += 0.1 * signal_1;

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length, [2, 3])

    # the last signal_ndim axes (1,2 or 3) will be transformed
    fft_1 = torch.fft.rfft(signal_1, fast_length, dim=-1)
    fft_2 = torch.fft.rfft(signal_2, fast_length, dim=-1)

    # take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions

    fft_multiplied = torch.conj(fft_1) * fft_2

    # back to time domain. 
    prelim_correlation = torch.fft.irfft(fft_multiplied, dim=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = torch.roll(prelim_correlation, (fast_length//2,))
    return final_result, torch.sum(final_result);

这里是测试结果的代码

import time
funcs = {'numpy-f64': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float64), 
         'numpy-f32': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float32), 
         'torch-cpu-f64': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float64), 
         'torch-cpu': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float32), 
         'torch-gpu-f64': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float64),
         'torch-gpu': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float32),
         }
times ={}
for batch in [1, 10, 100]:
    times[batch] = {}
    for l, f in funcs.items():
        t0 = time.time()
        t1, t2 = f(batch)
        tf = time.time()
        del t1
        del t2
        times[batch][l] = 1000 * (tf - t0) / batch;

我得到了如下结果

Run times for 32-bit and 64-bit precision, varying batch size, using numpy, pytorch-cpu and pytorch-gpu, with 5-smooth FFT length

Run times for 32-bit and 64-bit precision, varying batch size, using numpy, pytorch-cpu and pytorch-gpu, with 5-smooth FFT length

令我惊讶的是当数字不是那么平滑时的结果,例如使用 17 平滑长度的 torch 实现要好得多,所以我在这里使用对数刻度(批量大小为 100 时,torch gpu 比批量大小为 1 的 numpy 快 10000 倍)。

Run times for 32-bit and 64-bit precision, varying batch size, using numpy, pytorch-cpu and pytorch-gpu, with 5-smooth FFT length

请记住,这些函数通常在 GPU 上生成数据,我们希望将最终结果复制到 CPU,如果我们考虑将最终结果复制到 CPU 所花费的时间,我观察到的时间比互相关高 10 倍计算(随机数据生成 + 三个 FFT)。

关于python-3.x - 如何在傅立叶域中实现长信号的 Pytorch 一维互相关?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56711772/

相关文章:

math - 为什么频率表示为复数?

numpy - 为什么 Python 和 CUDA 不支持半精度复数浮点运算?

python - python装饰背后的机制

python - 在python中使用opencv流式传输的视频的异步列表

python-3.x - 将 Deform.FileData 架构节点保存为文件

cuda - 2-GPU 卡上的 PCI-e channel 分配?

machine-learning - 由 caffe 驱动且支持 GPU 的 Microsoft Azure VM

python - 计算复杂度

graphics - 在不同的队列系列上重用相同的主机可见缓冲区

python - 试图确认平均汇集等于使用 numpy 丢弃高频傅里叶系数