GPU Utilization: Measuring, Diagnosing, and Improving
This page explains how to check real-time GPU usage on a running job, review summary statistics after the fact, and profile your code to uncover bottlenecks. It also covers common causes of low or zero GPU utilization and offers practical fixes.
Measure GPU Utilization in Real Time
Find where your job is running
sqme
USER ACCOUNT JOBID PARTITION NAME NODES CPUS MIN_MEMORY TIME_LIMIT TIME NODELIST ST REASON
user1 research 501234 cpu analysis.s 1 6 4000M 3-00:00:00 12:30:53 cpu001 R None
user1 research 501235 cpu tokenize.s 1 1 4000M 2-00:00:00 9:36:50 cpu002 R None
user2 projectA 501236 gpu train_model.s 1 4 16000M 2-12:00:00 9:48:45 gpu001 R None
user2 projectA 501237 l40s preprocess.s 1 2 8000M 1-12:00:00 9:52:55 l40s01 R None
The column (NODELIST) shows the node for a running job.
If your job is queued, there is no node yet.
SSH to the compute node
ssh <compute-node>Watch GPU activity
watch -n 1 nvidia-smi
Look at the
GPU-Utilpercentage, temperature, and memory usage. PressCtrl+Cto exitwatch, andexitto leave the node.
Tip
nvidia-smi shows how often a GPU kernel is executing (duty cycle).
It does not tell you kernel quality, SIMD occupancy, or how many CUDA cores
are active. For deeper analysis, use profilers.
Reviewing GPU Job Statistics
For summary stats on completed and running jobs, there are several tools you can use:
Jobstats:
# Replace 1234567 with your JobID jobstats 1234567
Example output:
================================================================================ Slurm Job Statistics ================================================================================ Job ID: 1234567 User/Account: user1/research Job Name: bash State: RUNNING Nodes: 1 CPU Cores: 86 CPU Memory: 1032GB (12GB per CPU-core) GPUs: 4 QOS/Partition: gpu/h100 Cluster: cluster Start Time: Sun Sep 28, 2025 at 8:08 AM Run Time: 1-02:54:30 (in progress) Time Limit: 3-00:00:00 Overall Utilization ================================================================================ CPU utilization [| 3%] CPU memory usage [|||| 8%] GPU utilization [|||||||||||||||||||||||||||||||||||||||||||||| 92%] GPU memory usage [||||||||||||||||||||||||||||||||||||||||||||||100%] Detailed Utilization ================================================================================ CPU utilization per node (CPU time used/run time) node001: 4-06:45:51/139-00:39:39 (efficiency=3.1%) CPU memory usage per node - used/allocated node001: 80.1GB/1007.8GB (661.3MB/8.1GB per core of 124) GPU utilization per node node001 (GPU 0): 91.6% node001 (GPU 1): 92.2% node001 (GPU 2): 92.0% node001 (GPU 3): 92.0% GPU memory usage per node - maximum used/total node001 (GPU 0): 79.5GB/79.6GB (99.8%) node001 (GPU 1): 79.5GB/79.6GB (99.8%) node001 (GPU 2): 79.5GB/79.6GB (99.8%) node001 (GPU 3): 79.5GB/79.6GB (99.8%) Notes ================================================================================ * Example job statistics output.
seff:
seffis a lightweight tool that summarizes efficiency data from the Slurm accounting database. It is most useful once a job has finished.# Replace 1234567 with your JobID seff 1234567
Example output:
Job ID: 1234567 Cluster: cluster User/Group: user1/research State: RUNNING Nodes: 1 Cores per node: 86 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 96-13:11:54 core-walltime Job Wall-clock time: 1-02:56:39 Memory Utilized: 0.00 MB Memory Efficiency: 0.00% of 1007.81 GB (11.72 GB/core) WARNING: Efficiency statistics can only be obtained after the job has ended as seff is based on accounting database data.
How to Improve GPU Utilization
Think of each iteration as: (1) copy CPU→GPU, (2) run GPU kernels, (3) copy GPU→CPU. Utilization suffers when the GPU is starved for data or when kernels don’t exploit parallelism.
Practical remedies:
Feed the GPU faster
Use multi-threaded data loaders (e.g.,
num_workersin PyTorch).Stage data to high-performance storage (e.g.,
<your-scratch>or local SSD) instead of home or project space.Avoid small, frequent I/O; prefer fewer, larger reads/writes.
Tune the workload
Increase batch size (within memory limits) to amortize overhead.
Use vendor-optimized libraries (cuDNN, cuBLAS, NCCL).
Pin memory for host→device transfers when supported.
Right-size the hardware
Verify one GPU is well-utilized before scaling to multiple.
If your job uses a tiny working set or short kernels, a smaller slice (e.g., MIG) may outperform a full A100/H100/H200 for cost and queue time.
Zero GPU Utilization (0%)
Common causes and fixes:
Non-GPU code path: confirm your software is GPU-enabled and actually using CUDA (or ROCm, if applicable). Many tools fall back to CPU silently.
Environment not set up: ensure the correct CUDA toolkit and drivers are in use; match major versions to the node driver. Modern accelerators often require CUDA 12+.
Interactive hoarding: avoid long
sallocsessions holding idle GPUs. For interactive exploration, consider smaller GPU slices (e.g., MIG) if offered.
Low GPU Utilization (< ~15–30%)
Investigate and try:
Application/script configuration: double-check command-line flags and config files.
Data loader parallelism: increase CPU workers and prefetching.
Too many GPUs: do a scaling sweep (1, 2, 4 GPUs) and pick the knee of the curve.
Storage choice: write active job output to high-performance scratch (e.g.,
/scratch/$PI/). Avoid home paths during training.
Common Mistakes
Requesting GPUs for a CPU-only application.
Assuming multi-GPU works automatically. Many frameworks require explicit multi-GPU code.
Over-requesting resources “just in case.” Slurm fairshare/priority will reflect the requested resources, not only what your code actually used.
Build Your Skills
Helpful starting points:
Vendor tools and docs: CUDA Toolkit, cuDNN, Nsight Systems/Compute, NCCL.
Framework profilers: PyTorch/TensorBoard, TensorFlow Profiler.
Getting Help
Open a support ticket to help@arch.jhu.edu including: * JobID(s), Slurm script, module list, and a short description. * A brief profiler report (
nsysorncu) if available.
Adapted and expanded from community best practices. Portions inspired by Princeton Research Computing’s GPU documentation: `GPU Computing (Low Utilization) <https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#low-util>`_.