networking - 在 C 中执行简单的 MPI 程序时集群挂起/显示错误

标签 networking network-programming cluster-computing mpi openmpi

我正在尝试运行一个简单的 MPI 程序(多数组加法),它在我的 PC 上运行完美,但只是挂起或在集群中显示以下错误。 我正在使用 open mpi 和下面的命令来执行

集群网络配置(master&node1)

            MASTER
eth0      Link encap:Ethernet  HWaddr 00:22:19:A4:52:74  
          inet addr:10.1.1.1  Bcast:10.1.255.255  Mask:255.255.0.0
          inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16914 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7183 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2050581 (1.9 MiB)  TX bytes:981632 (958.6 KiB)

eth1      Link encap:Ethernet  HWaddr 00:22:19:A4:52:76  
          inet addr:192.168.41.203  Bcast:192.168.41.255  Mask:255.255.255.0
          inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:701 errors:0 dropped:0 overruns:0 frame:0
          TX packets:228 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:75457 (73.6 KiB)  TX bytes:25295 (24.7 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:88362 errors:0 dropped:0 overruns:0 frame:0
          TX packets:88362 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:21529504 (20.5 MiB)  TX bytes:21529504 (20.5 MiB)

peth0     Link encap:Ethernet  HWaddr 00:22:19:A4:52:74  
          inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:17175 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7257 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2373869 (2.2 MiB)  TX bytes:1020320 (996.4 KiB)
          Interrupt:16 Memory:da000000-da012800 

peth1     Link encap:Ethernet  HWaddr 00:22:19:A4:52:76  
          inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1112 errors:0 dropped:0 overruns:0 frame:0
          TX packets:302 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:168837 (164.8 KiB)  TX bytes:33241 (32.4 KiB)
          Interrupt:16 Memory:d6000000-d6012800 

virbr0    Link encap:Ethernet  HWaddr 52:54:00:E3:80:BC  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
            
                NODE 1
eth0      Link encap:Ethernet  HWaddr 00:22:19:53:42:C6  
          inet addr:10.1.255.253  Bcast:10.1.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16559 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7299 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1898811 (1.8 MiB)  TX bytes:1056294 (1.0 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:25 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3114 (3.0 KiB)  TX bytes:3114 (3.0 KiB)

peth0     Link encap:Ethernet  HWaddr 00:22:19:53:42:C6  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16913 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7276 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2221627 (2.1 MiB)  TX bytes:1076708 (1.0 MiB)
          Interrupt:16 Memory:f8000000-f8012800 

virbr0    Link encap:Ethernet  HWaddr 52:54:00:E7:E5:FF  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

错误

mpirun -machinefile machine -np 4 ./query
error code:
[[22877,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111)

代码

#include    <mpi.h>
#include    <stdio.h>
#include    <stdlib.h>
#include    <string.h>
#define     group           MPI_COMM_WORLD
#define     root            0
#define     size            100

int main(int argc,char *argv[])
{
int no_tasks,task_id,i;
MPI_Init(&argc,&argv);
MPI_Comm_size(group,&no_tasks);
MPI_Comm_rank(group,&task_id);
int arr1[size],arr2[size],local1[size],local2[size];
if(task_id==root)
{
    for(i=0;i<size;i++)
    {
        arr1[i]=arr2[i]=i;
    }
}
MPI_Scatter(arr1,size/no_tasks,MPI_INT,local1,size/no_tasks,MPI_INT,root,group);
MPI_Scatter(arr2,size/no_tasks,MPI_INT,local2,size/no_tasks,MPI_INT,root,group);
for(i=0;i<size/no_tasks;i++)
{
    local1[i]+=local2[i];
}
MPI_Gather(local1,size/no_tasks,MPI_INT,arr1,size/no_tasks,MPI_INT,root,group);
if(task_id==root)
{       
    printf("The Array Sum Is\n");
    for(i=0;i<size;i++)
    {
        printf("%d  ",arr1[i]);
    }
}
MPI_Finalize();
return 0;
}

最佳答案

告诉 Open MPI 不要使用虚拟桥接口(interface) virbr0 接口(interface)通过 TCP/IP 发送消息。或者更好地告诉它只使用 eth0 用于此目的:

$ mpiexec --mca btl_tcp_if_include eth0 ...

这来自 Open MPI 的 tcp BTL 组件的贪婪行为,该组件使用 TCP/IP 传输消息。它尝试使用每个节点上的所有可用网络接口(interface),以最大化数据带宽。两个节点都为 virbr0 配置了相同的子网地址。 Open MPI 发现两个地址相等,但由于子网匹配,它假定它应该能够通过 virbr0 进行通信。所以进程 A 正在尝试向驻留在另一个节点上的进程 B 发送消息。进程 B 监听端口 P 并且进程 A 知道这一点,因此它尝试连接到 192.168.122.1:P。但这实际上是给进程 A 所在节点上的 virbr0 接口(interface)的地址,因此该节点试图在一个不存在的端口上与自己对话,因此出现“连接被拒绝”错误。

关于networking - 在 C 中执行简单的 MPI 程序时集群挂起/显示错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15227933/

相关文章:

java - 监控两个节点之间的可用网络带宽的最佳方法是什么?

java - 使用 Applet 获取客户端 MAC ID 时出现问题

android - For IOS 中的 C++ 网络库

linux - 对于 MPI 主机文件,有多少个插槽

opencl - OpenCL 准备好在 CPU 上使用了吗?

java - 如何使用 docker-compose 在多个容器之间传递参数

azure - 由于防火墙的原因,无法从 VM 访问 Azure Blob 存储

cluster-computing - 在 CoreOS 上使用 etcd 进行服务发现时如何处理陈旧数据?

apache - 启动hadoop时出错

Windows API 清除身份验证 token