Parallel-computing jobs
#######################
Some workloads are *embarrassingly parallel*: you simply submit a Slurm
**job array** and each task runs an independent serial program on its own
core. When you need multiple CPU cores **co-operating on the *same*
problem**, an application must be written (or compiled) for one of the
parallel-programming models below.
.. contents::
:local:
:depth: 1
Shared-memory (threads)
***********************
* One executable starts; it then spawns **threads** that run
concurrently on cores of the **same node**.
* All threads share a single address-space – every variable is visible
to every thread. Synchronisation primitives (e.g. *critical*, *atomic*,
barriers, mutexes) prevent race-conditions.
* Typical APIs: **OpenMP**, **POSIX Threads (pthreads)**; on GPUs,
CUDA/HIP threads follow a similar memory model.
OpenMP example::
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8 # 8 CPU cores on the node
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --cpu-bind=cores ./my_openmp.exe
Distributed-memory (tasks)
**************************
* The program starts **MPI ranks (tasks)**; each has its *own* memory.
* Ranks exchange data by explicit **message passing** over InfiniBand or
Ethernet. Every `MPI_Send` must match an `MPI_Recv`.
* MPI transparently uses shared memory when ranks reside on the same
node for best performance.
MPI example::
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16 # 32 ranks total
#SBATCH --cpus-per-task=1
module load OpenMPI
srun --mpi=pmix_v3 ./my_mpi.exe
Hybrid model (MPI + OpenMP)
***************************
Large codes often place **one MPI rank per socket** and spawn several
OpenMP threads within that rank – combining the strengths of both
models.
Hybrid example::
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2 # 2 MPI ranks / node
#SBATCH --cpus-per-task=8 # 8 threads per rank (16 cores/node)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./my_hybrid.exe
.. tip::
Make sure the total cores requested match
``ntasks-per-node × cpus-per-task × nodes``.
GPU parallelism
***************
GPU-accelerated programs (CUDA, HIP, OpenACC, …) still follow the
shared-memory idea, but the “threads” live on the GPU.
Further reading
***************
* `OpenMP reference `_
* `MPI standard `_