Skip to Main Content

Cloud Platform

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

SLURM sbatch Fails With OpenMPI Error

User_9PL5ENov 2 2021

I'm trying to run a SLURM sbatch job on Oracle Cloud, but it is failing with an OpenMPI error:

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            inst-a7j5w-team3
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4125

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           inst-a7j5w-team3
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      inst-a7j5w-team3
Framework: pml
Component: ucx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[inst-a7j5w-team3:66225] *** An error occurred in MPI_Init
[inst-a7j5w-team3:66225] *** reported by process [1752891393,139706696204288]
[inst-a7j5w-team3:66225] *** on a NULL communicator
[inst-a7j5w-team3:66225] *** Unknown error
[inst-a7j5w-team3:66225] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[inst-a7j5w-team3:66225] ***    and potentially your MPI job)
[inst-a7j5w-team3:66213] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[inst-a7j5w-team3:66213] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[inst-a7j5w-team3:66213] 11 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[inst-a7j5w-team3:66213] 3 more processes have sent help message help-mca-base.txt / find-available:not-valid
[inst-a7j5w-team3:66213] 3 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[inst-a7j5w-team3:66213] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

The following is my sbatch file:

#!/bin/bash
# Run this file by typing out
# sbatch ./slurm_command.sh
# in the command line.

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --partition=compute
cd $SLURM_SUBMIT_DIR

# MCA must be set as an environment variable. In the case of the Oracle Cloud Cluster
# this would be the following
# export MCA=' --mca btl_openib_warn_no_device_params_found 0 --mca pml ob1 --mca btl ^openib'
mpirun -mca btl self -x UCX_TLS=rc,self,sm -x HCOLL_ENABLE_MCAST_ALL=0 -mca coll_hcoll_enable 0 -x UCX_IB_TRAFFIC_CLASS=105 -x UCX_IB_GID_INDEX=3 --cpu-set 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35 -np 4 ./{executable}

I am running the sbatch command from my head node, and my compute nodes are BM.Optimized3.36. OpenMPI version 4.1.1.
I recognize that the warnings should be fixed, but I'm not sure how to set preset parameters and google doesn't give anything useful. Any help here would be appreciated.

Comments
Post Details
Added on Nov 2 2021
0 comments
1,892 views