Parallel-computing jobs

Some workloads are embarrassingly parallel: you simply submit a Slurm job array and each task runs an independent serial program on its own core. When you need multiple CPU cores co-operating on the *same* problem, an application must be written (or compiled) for one of the parallel-programming models below.

Shared-memory (threads)

One executable starts; it then spawns threads that run concurrently on cores of the same node.
All threads share a single address-space – every variable is visible to every thread. Synchronisation primitives (e.g. critical, atomic, barriers, mutexes) prevent race-conditions.
Typical APIs: OpenMP, POSIX Threads (pthreads); on GPUs, CUDA/HIP threads follow a similar memory model.

OpenMP example:

#SBATCH --nodes=1
#SBATCH --cpus-per-task=8        # 8 CPU cores on the node
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --cpu-bind=cores ./my_openmp.exe

Distributed-memory (tasks)

The program starts MPI ranks (tasks); each has its own memory.
Ranks exchange data by explicit message passing over InfiniBand or Ethernet. Every MPI_Send must match an MPI_Recv.
MPI transparently uses shared memory when ranks reside on the same node for best performance.

MPI example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16     # 32 ranks total
#SBATCH --cpus-per-task=1
module load OpenMPI
srun --mpi=pmix_v3 ./my_mpi.exe

Hybrid model (MPI + OpenMP)

Large codes often place one MPI rank per socket and spawn several OpenMP threads within that rank – combining the strengths of both models.

Hybrid example:

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2      # 2 MPI ranks / node
#SBATCH --cpus-per-task=8        # 8 threads per rank  (16 cores/node)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./my_hybrid.exe

Tip

Make sure the total cores requested match ntasks-per-node × cpus-per-task × nodes.

GPU parallelism 

GPU-accelerated programs (CUDA, HIP, OpenACC, …) still follow the shared-memory idea, but the “threads” live on the GPU.

Parallel-computing jobs

Shared-memory (threads)

Distributed-memory (tasks)

Hybrid model (MPI + OpenMP)

GPU parallelism

Further reading

GPU parallelism 

Further reading 