c - 对于大消息，与 MPI (mvapich2) 中的计算重叠通信

我有一个非常简单的代码，一个数据分解问题，其中在循环中每个进程在每个周期向其前后的队列发送两条大消息。我在 SMP 节点集群(AMD Magny 核心，每个节点 32 个核心，每个插槽 8 个核心)中运行此代码。我花了一段时间来优化这段代码。我使用 pgprof 和 tau 进行分析，在我看来，瓶颈在于通信。我试图将通信与代码中的计算重叠，但看起来实际通信是在计算完成时开始的:(

我在就绪模式(MPI_Rsend_init)下使用持久通信，并在 MPI_Start_all 和 MPI_Wait_all 之间完成大量计算。代码如下所示:

void main(int argc, char *argv[])
{
  some definitions;
  some initializations;

  MPI_Init(&argc, &argv);

  MPI_Rsend_init( channel to the rank before );
  MPI_Rsend_init( channel to the rank after );
  MPI_Recv_init( channel to the rank before );
  MPI_Recv_init( channel to the rank after );

  for (timestep=0; temstep<Time; timestep++)
  {
    prepare data for send;
    MPI_Start_all();

    do computations;

    MPI_Wait_all();

    do work on the received data;
  }
  MPI_Finalize();
}

不幸的是，实际的数据传输直到计算完成后才开始，我不明白为什么。该网络使用 QDR InfiniBand Interconnect 和 mvapich2。每条消息大小为23MB(总共发送46MB消息)。我尝试将消息传递更改为急切模式，因为系统中的内存足够大。我在作业脚本中使用以下标志: MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=套接字
MV2_CPU_BINDING_POLICY=一堆

这使我的性能提高了约 8%，可能是因为 SMP 节点内的等级放置得更好，但通信问题仍然存在。我的问题是为什么我不能有效地将通信与计算重叠？有没有我应该使用但我错过了的标志？我知道出了什么问题，但我所做的一切还不够。

根据 SMP 节点内的等级顺序，节点之间的实际消息大小也是 46MB (2x23MB)，并且等级处于循环中。你能帮我么？要查看其他用户使用的标志，我检查了/etc/mvapich2.conf，但它是空的。

我还应该使用其他方法吗？您认为单方面的沟通会带来更好的表现吗？我感觉有一面旗帜或其他我不知道的东西。

非常感谢。

最佳答案

MPI 中有一种叫做操作进展的东西。该标准允许非阻塞操作仅在进行正确的测试/等待调用后才能完成:

A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.

(标准文本中粗体的单词也加粗；强调是我添加的)

尽管此文本来自有关非阻塞通信的部分(MPI-3.0 的第 3.7 节；该文本与 MPI-2.2 中的文本完全相同)，但它也适用于持久通信请求。

我没有使用过MVAPICH2，但我可以谈谈Open MPI 中的实现方式。每当启动非阻塞操作或启动持久通信请求时，该操作都会添加到待处理操作队列中，然后以两种可能的方式之一进行:

如果 Open MPI 是在没有异步进程线程的情况下编译的，则每次调用发送/接收或某些等待/测试操作时都会进行未完成的操作；
如果 Open MPI 是使用异步进程线程编译的，则即使没有进行进一步的通信调用，操作也会在后台进行。

默认行为是不启用异步进程线程，因为这样做会以某种方式增加操作的延迟。

MVAPICH 站点目前无法从这里访问，但早些时候我在功能列表中看到了异步进度的提及。也许这就是您应该开始的地方 - 寻找启用它的方法。

另请注意，MV2_SMP_EAGERSIZE 控制共享内存协议(protocol)即时消息大小，不会影响 InfiniBand 协议(protocol)，即它只能改善驻留在同一集群节点上的进程之间的通信。

顺便说一句，不能保证接收操作会在相邻队列中的就绪发送操作之前启动，因此它们可能无法按预期运行，因为时间排序非常重要。

关于c - 对于大消息，与 MPI (mvapich2) 中的计算重叠通信，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13963808/

c - 对于大消息，与 MPI (mvapich2) 中的计算重叠通信

上一篇：node.js - 异步并行读取文件

下一篇：wpf - 如何在没有扩展器的情况下对数据网格进行分组