java - OpenMPI:MPI.Init 卡在 Java 中 - 如何调试?

标签 java mpi openmpi

我已经根据 https://www.open-mpi.org/faq/?category=java 在本地编译了支持 Java 的 OpenMPI 。在我使用 Oracle Java 8 的本地计算机上,此方法工作正常,但在使用 OpenJDK 8 的集群上,此方法会导致 MPI Init 挂起。您对如何从这里继续进行有任何指示吗?跟踪?想尝试其他版本的 Java?我找不到任何关于该接口(interface)在 Java 版本方面支持的文档。

package com.acme.hello;
import mpi.*;

public class HelloMpi {
    public static void main(String args[]) throws Exception {
        int me,size;
        System.out.println("attempting MPI init");
        args=MPI.Init(args);
        System.out.println("MPI init done");
    }
}

> java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

> ~/NQSIM/java$ mpirun -version
mpirun (Open MPI) 3.1.2

> ~/NQSIM/java$ mpirun -np 2 java -classpath 
"./target/test-classes/" com.acme.hello.HelloMpi
attempting MPI init
attempting MPI init
(hangs here forever)

编辑:examples/hello_c 显示相同的行为,因此它与 Java 无关。我想这一定是交通工具里的东西。我必须仅使用用户权限构建/安装 OpenMPI。系统上已有 OpenMPI,但不支持 Java。关于如何继续的任何想法?

Edit2:切换到不同的字节层,例如使用 --mca btl vader,self 有效。以下是聚会停止前 --mca btl_base_verbose 的输出:

[fdr4:33013] mca: base: components_register: registering framework btl components
[fdr4:33013] mca: base: components_register: found loaded component sm
[fdr4:33014] mca: base: components_register: registering framework btl components
[fdr4:33014] mca: base: components_register: found loaded component sm
[fdr4:33013] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: found loaded component self
[fdr4:33014] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: component self register function successful
[fdr4:33014] mca: base: components_register: found loaded component self
[fdr4:33013] mca: base: components_register: found loaded component tcp
[fdr4:33014] mca: base: components_register: component self register function successful
[fdr4:33013] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component tcp
[fdr4:33013] mca: base: components_register: found loaded component vader
[fdr4:33013] mca: base: components_register: component vader register function successful
[fdr4:33013] mca: base: components_register: found loaded component openib
[fdr4:33014] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component vader
[fdr4:33014] mca: base: components_register: component vader register function successful
[fdr4:33014] mca: base: components_register: found loaded component openib
[fdr4:33013] mca: base: components_register: component openib register function successful
[fdr4:33013] mca: base: components_open: opening btl components
[fdr4:33013] mca: base: components_open: found loaded component sm
[fdr4:33013] mca: base: components_open: component sm open function successful
[fdr4:33013] mca: base: components_open: found loaded component self
[fdr4:33013] mca: base: components_open: component self open function successful
[fdr4:33013] mca: base: components_open: found loaded component tcp
[fdr4:33013] mca: base: components_open: component tcp open function successful
[fdr4:33013] mca: base: components_open: found loaded component vader
[fdr4:33013] mca: base: components_open: component vader open function successful
[fdr4:33013] mca: base: components_open: found loaded component openib
[fdr4:33013] mca: base: components_open: component openib open function successful
[fdr4:33013] select: initializing btl component sm
[fdr4:33014] mca: base: components_register: component openib register function successful
[fdr4:33014] mca: base: components_open: opening btl components
[fdr4:33014] mca: base: components_open: found loaded component sm
[fdr4:33014] mca: base: components_open: component sm open function successful
[fdr4:33014] mca: base: components_open: found loaded component self
[fdr4:33014] mca: base: components_open: component self open function successful
[fdr4:33014] mca: base: components_open: found loaded component tcp
[fdr4:33014] mca: base: components_open: component tcp open function successful
[fdr4:33014] mca: base: components_open: found loaded component vader
[fdr4:33014] mca: base: components_open: component vader open function successful
[fdr4:33014] mca: base: components_open: found loaded component openib
[fdr4:33014] mca: base: components_open: component openib open function successful
[fdr4:33014] select: initializing btl component sm
[fdr4:33014] select: init of component sm returned success
[fdr4:33014] select: initializing btl component self
[fdr4:33014] select: init of component self returned success
[fdr4:33014] select: initializing btl component tcp
[fdr4:33013] select: init of component sm returned success
[fdr4:33013] select: initializing btl component self
[fdr4:33013] select: init of component self returned success
[fdr4:33013] select: initializing btl component tcp
[fdr4:33014] select: init of component tcp returned success
[fdr4:33014] select: initializing btl component vader
[fdr4:33013] select: init of component tcp returned success
[fdr4:33013] select: initializing btl component vader
[fdr4:33014] select: init of component vader returned success
[fdr4:33014] select: initializing btl component openib
[fdr4:33013] select: init of component vader returned success
[fdr4:33013] select: initializing btl component openib
[fdr4:33014] Checking distance from this process to device=mlx4_0
[fdr4:33013] Checking distance from this process to device=mlx4_0
[fdr4:33013] hwloc_distances->nbobjs=4
[fdr4:33013] hwloc_distances->latency[0]=1.000000
[fdr4:33013] hwloc_distances->latency[1]=2.000000
[fdr4:33013] hwloc_distances->latency[2]=3.000000
[fdr4:33014] hwloc_distances->nbobjs=4
[fdr4:33014] hwloc_distances->latency[0]=1.000000
[fdr4:33014] hwloc_distances->latency[1]=2.000000
[fdr4:33014] hwloc_distances->latency[2]=3.000000
[fdr4:33013] hwloc_distances->latency[3]=2.000000
[fdr4:33013] hwloc_distances->latency[4]=2.000000
[fdr4:33013] hwloc_distances->latency[5]=1.000000
[fdr4:33013] hwloc_distances->latency[6]=2.000000
[fdr4:33013] hwloc_distances->latency[7]=3.000000
[fdr4:33013] ibv_obj->logical_index=1
[fdr4:33014] hwloc_distances->latency[3]=2.000000
[fdr4:33014] hwloc_distances->latency[4]=2.000000
[fdr4:33014] hwloc_distances->latency[5]=1.000000
[fdr4:33014] hwloc_distances->latency[6]=2.000000
[fdr4:33014] hwloc_distances->latency[7]=3.000000
[fdr4:33014] ibv_obj->logical_index=1
[fdr4:33013] my_obj->logical_index=0
[fdr4:33013] Process is bound: distance to device is 2.000000
[fdr4:33014] my_obj->logical_index=0
[fdr4:33014] Process is bound: distance to device is 2.000000
[fdr4:33013] [rank=0] openib: using port mlx4_0:1
[fdr4:33013] select: init of component openib returned success
[fdr4:33014] [rank=1] openib: using port mlx4_0:1
[fdr4:33014] select: init of component openib returned success
[fdr4:33013] mca: bml: Using self btl for send to [[59315,1],0] on node fdr4
[fdr4:33014] mca: bml: Using self btl for send to [[59315,1],1] on node fdr4
[fdr4:33013] mca: bml: Using vader btl for send to [[59315,1],1] on node fdr4
[fdr4:33014] mca: bml: Using vader btl for send to [[59315,1],0] on node fdr4

最佳答案

已经解决了。在这种情况下,问题是对用户施加的限制之一。服务器被配置为使用默认设置,但在更改 /etc/security/limits.conf 中的以下内容后,它开始使用默认字节层(因为我无法直接自己测试它,不幸的是,我不知道这两个设置中哪一个是肇事者):

*               -       memlock         unlimited
*               -       nofile          16384

关于java - OpenMPI:MPI.Init 卡在 Java 中 - 如何调试?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52815425/

相关文章:

language-agnostic - 并行代码文档的哪种图表?

C++ OpenMPI 链表

java - 请为这段 Java 泛型代码建议帮助方法

java - 在其他类中使用数组输出

c - 如何使具有两个线程的两个进程在MPI中相互接收、发送?

c - 在 Open MPI 中处理图像 block

c++ - 如何禁用 MPI 的 C++ 包装器?

ssh - 配置 MPI hostsfile 以使用多个用户身份

java - TouchScreen Nokia C5 -J2ME 的命令操作按钮和滚动在更改系统时间后不起作用

java - 无法在java程序中创建初始上下文