GPU Jobs

Rockfish provides several partitions equipped with NVIDIA GPUs for compute-intensive workloads. This page outlines available GPU partitions, submission requirements, usage policies, and tools for tracking GPU utilization.

Available GPU Partitions

a100

A100 is designed for GPU-enabled workflows on 40 GB A100 cards.

GPUs: 4× NVIDIA A100 (40 GB each)
Requirements:
- Slurm allocation: <PI_NAME>_gpu (e.g., jsmith123_gpu)
- QoS: qos_gpu
Max Runtime: 3 days

ica100

ICA100 is for GPU workflows on upgraded A100 hardware.

GPUs: 4× NVIDIA A100 (80 GB each)
Requirements:
- Slurm allocation: <PI_NAME>_gpu
- QoS: qos_gpu
Max Runtime: 3 days

mig_class

MIG_CLASS provides GPUs for classroom use. It uses Multi-Instance GPU (MIG) mode to create isolated GPU slices for student jobs.

GPUs: 4× NVIDIA A100 (80 GB each), segmented into 12× 20 GB MIGs
Requirements:
- Slurm allocation: <class_name>-<PI_NAME> (e.g., cs601-jsmith123)
- QoS: mig_class
Max Runtime: 1 day

l40s

L40s is intended for high-performance workflows that benefit from large GPU memory and performance.

GPUs: 8× NVIDIA L40s (48 GB each)
Requirements:
- Slurm allocation: <PI_NAME>_gpu
- QoS: qos_gpu
Max Runtime: 1 day

Access Requirements

By default, GPU partitions are not accessible to all users. To gain access, PIs must:

Request a GPU allocation by contacting the Rockfish support team.
Be assigned to a project-specific Slurm account ending in _gpu (e.g., jsmith123_gpu).
Submit jobs using that account and the corresponding QoS (e.g., qos_gpu).

GPU Usage Limits

The qos_gpu configuration enforces a strict usage limit:

MaxTRESPA: gres/gpu=10

This means that no more than 10 GPUs can be in use at once per account, regardless of partition or job size. If your job exceeds the limit, it will remain pending with the reason:

(QOSMaxGRESPerAccount)

To check the current GPU usage per account, administrators may use:

squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.6C %R" --qos=qos_gpu

GPU Job Submission Example

Submit a basic batch job requesting two GPUs:

#SBATCH --partition=a100
#SBATCH --qos=qos_gpu
#SBATCH --account=jsmith123_gpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=6
#SBATCH --time=24:00:00

Check assigned GPU devices:

echo $CUDA_VISIBLE_DEVICES

Monitoring GPU Usage with jobstats

Rockfish provides the jobstats tool to evaluate GPU, CPU, and memory usage for completed and running jobs.

Basic usage:

jobstats <jobid>

Output includes:

GPU utilization over job duration
Memory used per GPU
Node assignments
Efficiency metrics

For more on viewing job status and resource usage, visit: Viewing Job Status & Efficiency

Helpful Commands

View available GPU partitions:
```
sinfo -p a100,ica100,l40s,mig_class
```

Additional Tips

Avoid requesting more GPUs than necessary — this may increase wait time.
Always confirm that your Slurm account and QoS match the partition.
Use interact with –gres=gpu:<N> to start a live GPU session.

Note

If you’re unsure whether your PI has GPU access, or you encounter errors submitting GPU jobs, please open a ticket on the Support page or contact the Rockfish administrators directly.