openmpi - mpi4py irecv 导致段错误

标签 openmpi mpich mpi4py

我正在运行以下代码,该代码使用命令mpirun -n 2 python - 将数组从rank 0发送到1 - u test_irecv.py > 输出 2>&1.

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    if asyncr: comm.isend(arrs, dest=1).wait()
    else: comm.send(arrs, dest=1)
else:
    if asyncr: arrv = comm.irecv(source=0).wait()
    else: arrv = comm.recv(source=0)

print('Done!', comm.Get_rank())

使用 asyncr = 0 在同步模式下运行会给出预期的输出

Done! 0
Done! 1

但是使用 asyncr = 1 在异步模式下运行会出现如下错误。 我需要知道为什么它在同步模式下运行良好,而在异步模式下运行不佳。

使用asyncr = 1输出:

Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000643d1 ompi_errhandler_request_invoke()  ???:0
 2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
 5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
 6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
 7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
 8 0x000000000006e611 call_function()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
 9 0x000000000006e611 _PyEval_EvalFrameDefault()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main()  ???:0
22 0x00000000004006ba _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

版本如下:

  • Python:3.7.0
  • mpi4py:3.0.0
  • mpiexec --version 给出 mpiexec (OpenRTE) 3.1.2
  • mpicc -v 提供 icc 版本 18.0.3(gcc 版本 7.3.0 兼容性)

在另一个系统中使用 MPICH 运行 asyncr = 1 会得到以下输出。

Done! 0
Traceback (most recent call last):
  File "test_irecv.py", line 14, in <module>
    if asyncr: arrv = comm.irecv(source=0).wait()
  File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
  File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23830,1],1]
  Exit code:    1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

最佳答案

显然这是 mpi4py 中的一个已知问题,如 https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when 中所述。 。利桑德罗·达尔辛 说

The implementation of irecv() for large messages requires users to pass a buffer-like object large enough to receive the pickled stream. This is not documented (as most of mpi4py), and even non-obvious and unpythonic...

修复方法是将足够大的预分配bytearray传递给irecv。一个工作示例如下。

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    comm.isend(arrs, dest=1).wait()
else:
    arrv = comm.irecv(bytearray(1<<20), source=0).wait()

print('Done!', comm.Get_rank())

关于openmpi - mpi4py irecv 导致段错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59559597/

相关文章:

c++ - 为什么 mpirun 默认复制程序?

c++ - 通过 OpenMPI 进行非阻塞数据共享

mpiexec 与 mpirun

bash - 从 bash 脚本运行 mpi 作业,批处理模式

python - 对于大型数组,MPI.Gather 调用挂起

c++ - 可以在 Internet 上而不是在 LAN 集群内分发 MPI (C++) 程序吗?

c++ - 无法在两台以上的机器上运行 OpenMPI

c - 通过 C 中的 MPI_Send 和 MPI_recv 进行结构操作

python - 是否有适用于 Python 3.7.1 的 Microsoft Visual C++ 编译器?

python - 如何使用 msmpi 在 Windows 10 上安装 mpi4py