c - 带有 C : Passive RMA synchronization 的 MPI

因为到目前为止我没有找到我的问题的答案并且我正处于对这个问题发疯的边缘，所以我只是问这个折磨我的问题;-)

我正在研究我已经编程的节点消除算法的并行化。目标环境是一个集群。

在我的并行程序中，我区分了主进程(在我的情况下为 0 级)和工作从属(除 0 之外的每个等级)。
我的想法是，主人正在跟踪哪些奴隶可用并发送它们然后工作。因此，出于其他一些原因，我尝试建立一个基于被动 RMA 的工作流，其中包含锁定-放置-解锁序列。我使用一个名为 schedule 的整数数组，其中数组中表示等级的每个位置要么是 0 表示工作进程，要么是 1 表示可用进程(因此，如果 schedule[1]=1 一个可用于工作)。
如果一个进程完成了它的工作，它将把它放在主节点上的数组 1 中，以表明它的可用性。我为此尝试的代码如下:

 MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
 printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
 MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
 printf("Process %d:\t put operation called\n",myrank);
 MPI_Win_unlock(0,win); // the window is unlocked

它工作得很好，特别是当主进程与锁结束的屏障同步时，因为那时主进程的输出是在 put 操作之后进行的。

作为下一步，我尝试让 master 定期检查是否有可用的 slave。因此我创建了一个while循环来重复，直到每个进程都表明它的可用性(我重复它是程序教我原理，我知道实现仍然没有做我想要的)。
该循环处于基本变体中，仅打印我的数组计划，然后在函数 fnz 中检查是否有除 master 之外的其他工作进程:

while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}

然后这个概念爆炸了。在反转过程并通过主设备从从设备获取所需信息而不是将其从从设备放置到主设备后，我发现我的主要问题是获取锁:解锁命令不成功，因为在 put 的情况下，根本不授予锁，而在 get 的情况下，仅当从属进程完成其工作并在屏障中等待时才授予锁。在我看来，我的想法一定有一个严重的错误。只有当目标进程处于同步整个通信器的屏障中时才能实现锁定，这不可能是被动 RMA 的想法。然后我就可以进行标准的发送/接收操作了。我想要实现的是，进程 0 一直在委派工作，并且能够通过从属设备的 RMA 确定它可以委派给谁。
请有人帮助我并解释我如何在进程 0 上休息以允许其他进程获得锁？

先感谢您!

更新:
我不确定您是否曾经使用过锁，只是想强调一下，我完全能够获得远程内存窗口的更新副本。如果我从奴隶那里获得可用性，那么只有当奴隶在屏障中等待时才会授予锁。所以我要做的是，进程 0 执行 lock-get-unlock，而进程 1 和 2 正在模拟工作，使得进程 2 的占用时间明显长于一个。我期望的结果是进程 0 打印一个时间表 (0,1,0)，因为进程 0 根本不被询问它是否正在工作，进程 1 已完成工作并且进程 2 仍在工作。在下一步中，当进程 2 准备好时，我期望输出 (0,1,1)，因为从站都已准备好进行新工作。我得到的是，奴隶只在他们在屏障中等待时才授予进程 0 的锁，所以我得到的第一个也是唯一的输出是我期望的最后一个输出，这表明锁是为每个人授予的先处理，当它完成它的工作时。因此，如果有人可以告诉我目标进程何时可以授予锁，而不是试图混淆我对 的了解。被动 RMA，我将非常感激

最佳答案

首先，被动 RMA 机制不会以某种方式神奇地插入远程进程的内存，因为没有多少 MPI 传输具有真正的 RDMA 功能，即使是那些具有真正 RDMA 功能的传输(例如 InfiniBand)也需要大量的非被动参与目标以允许发生被动 RMA 操作。这在 MPI 标准中进行了解释，但以非常抽象的形式，即通过 RMA 窗口公开的内存的公共(public)和私有(private)副本。

使用 MPI-2 实现工作和可移植无源 RMA 涉及几个步骤。

第一步:目标进程中的窗口分配

出于便携性和性能原因，应使用 MPI_ALLOC_MEM 分配窗口的内存。 :

int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);

int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

for (int i = 0; i < size; i++)
{
   schedule[i] = 0;
}

MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
   MPI_COMM_WORLD, &win);

...

MPI_Win_free(win);
MPI_Free_mem(schedule);

第二步:目标内存同步

MPI 标准禁止同时访问窗口中的同一位置(MPI-2.2 规范中的第 11.3 节):

It is erroneous to have concurrent conflicting accesses to the same memory location in a window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.

因此每次访问schedule[]在目标中必须受锁保护(共享，因为它只读取内存位置):

while (!ready)
{
   MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
   ready = fnz(schedule, oldschedule, size);
   MPI_Win_unlock(0, win);
}

在目标处锁定窗口的另一个原因是为 MPI 库提供条目，从而促进 RMA 操作的本地部分的进展。即使在使用不支持 RDMA 的传输时，MPI 也提供可移植 RMA，例如TCP/IP 或共享内存，这需要在目标上完成大量主动工作(称为进程)以支持“被动”RMA。一些库提供了可以在后台进行操作的异步进程线程，例如使用 --enable-opal-multi-threads 配置时打开 MPI (默认禁用)，但依赖这种行为会导致程序不可移植。这就是为什么 MPI 标准允许 put 操作的以下宽松语义(第 11.7 节，第 365 页):

6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.

If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related synchronization call.

这也在标准的同一部分(第 367 页)的示例 11.12 中进行了说明。事实上，Open MPI 和 Intel MPI 不更新 schedule[] 的值如果主代码中的锁定/解锁调用被注释掉。 MPI 标准进一步建议(§11.7, p. 366):

Advice to users. A user can write correct programs by following the following rules:

...

lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are protected by shared locks, both for local accesses and for RMA accesses.

第 3 步:向 MPI_PUT 提供正确的参数在原点
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win);会将所有内容转移到目标窗口的第一个元素中。鉴于目标窗口是使用 disp_unit == sizeof(int) 创建的，因此正确调用是:

int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);

one 的局部值因此被转移到rank * sizeof(int)目标窗口开始之后的字节。如果 disp_unit设置为 1，正确的 put 是:

MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);

第 4 步:处理实现细节

上述详细程序可与英特尔 MPI 开箱即用。对于 Open MPI，必须特别小心。该库是围绕一组框架和实现模块构建的。 osc (单向通信)框架有两种实现方式 - rdma和 pt2pt .默认值(在 Open MPI 1.6.x 和可能更早的版本中)是 rdma并且由于某种原因，当 MPI_WIN_(UN)LOCK 时，它不会在目标端进行 RMA 操作。被调用，这会导致类似死锁的行为，除非进行另一个通信调用(在您的情况下为 MPI_BARRIER)。另一方面，pt2pt模块按预期进行所有操作。因此，对于 Open MPI，必须像下面这样启动程序才能具体选择 pt2pt。零件:

$ mpiexec --mca osc pt2pt ...

一个完整的 C99 示例代码如下:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
    static double starttime = -1.0;
    int diff = 0;

    for (int i = 0; i < size; i++)
       diff |= (schedule[i] != oldschedule[i]);

    if (diff)
    {
       int res = 0;

       if (starttime < 0.0) starttime = MPI_Wtime();

       printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
       for (int i = 0; i < size; i++)
       {
          printf("\t%d", schedule[i]);
          res += schedule[i];
          oldschedule[i] = schedule[i];
       }
       printf("\n");

       return(res == size-1);
    }
    return 0;
}

int main (int argc, char **argv)
{
    MPI_Win win;
    int rank, size;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
       int *oldschedule = malloc(size * sizeof(int));
       // Use MPI to allocate memory for the target window
       int *schedule;
       MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

       for (int i = 0; i < size; i++)
       {
          schedule[i] = 0;
          oldschedule[i] = -1;
       }

       // Create a window. Set the displacement unit to sizeof(int) to simplify
       // the addressing at the originator processes
       MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
          MPI_COMM_WORLD, &win);

       int ready = 0;
       while (!ready)
       {
          // Without the lock/unlock schedule stays forever filled with 0s
          MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
          ready = fnz(schedule, oldschedule, size);
          MPI_Win_unlock(0, win);
       }
       printf("All workers checked in using RMA\n");

       // Release the window
       MPI_Win_free(&win);
       // Free the allocated memory
       MPI_Free_mem(schedule);
       free(oldschedule);

       printf("Master done\n");
    }
    else
    {
       int one = 1;

       // Worker processes do not expose memory in the window
       MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

       // Simulate some work based on the rank
       sleep(2*rank);

       // Register with the master
       MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
       MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
       MPI_Win_unlock(0, win);

       printf("Worker %d finished RMA\n", rank);

       // Release the window
       MPI_Win_free(&win);

       printf("Worker %d done\n", rank);
    }

    MPI_Finalize();
    return 0;
}

具有 6 个进程的示例输出:

$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule:      0       0       0       0       0       0
[ 1.995] Schedule:      0       1       0       0       0       0
Worker 1 finished RMA
[ 3.989] Schedule:      0       1       1       0       0       0
Worker 2 finished RMA
[ 5.988] Schedule:      0       1       1       1       0       0
Worker 3 finished RMA
[ 7.995] Schedule:      0       1       1       1       1       0
Worker 4 finished RMA
[ 9.988] Schedule:      0       1       1       1       1       1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done

关于c - 带有 C : Passive RMA synchronization 的 MPI，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18737545/

c - 带有 C : Passive RMA synchronization 的 MPI

上一篇：c - 在没有 Malloc 的情况下分配结构数组？

下一篇：c - C 结构如何引用自身？