GPU Jobs

Rockfish provides several partitions equipped with NVIDIA GPUs for compute-intensive workloads. This page outlines available GPU partitions, submission requirements, usage policies, and tools for tracking GPU utilization.

Available GPU Partitions

a100

A100 is designed for GPU-enabled workflows on 40 GB A100 cards.

  • GPUs: 4× NVIDIA A100 (40 GB each)

  • Requirements:

    • Slurm allocation: <PI_NAME>_gpu (e.g., jsmith123_gpu)

    • QoS: qos_gpu

  • Max Runtime: 3 days

ica100

ICA100 is for GPU workflows on upgraded A100 hardware.

  • GPUs: 4× NVIDIA A100 (80 GB each)

  • Requirements:

    • Slurm allocation: <PI_NAME>_gpu

    • QoS: qos_gpu

  • Max Runtime: 3 days

mig_class

MIG_CLASS provides GPUs for classroom use. It uses Multi-Instance GPU (MIG) mode to create isolated GPU slices for student jobs.

  • GPUs: 4× NVIDIA A100 (80 GB each), segmented into 12× 20 GB MIGs

  • Requirements:

    • Slurm allocation: <class_name>-<PI_NAME> (e.g., cs601-jsmith123)

    • QoS: mig_class

  • Max Runtime: 1 day

l40s

L40s is intended for high-performance workflows that benefit from large GPU memory and performance.

  • GPUs: 8× NVIDIA L40s (48 GB each)

  • Requirements:

    • Slurm allocation: <PI_NAME>_gpu

    • QoS: qos_gpu

  • Max Runtime: 1 day

Access Requirements

By default, GPU partitions are not accessible to all users. To gain access, PIs must:

  1. Request a GPU allocation by contacting the Rockfish support team.

  2. Be assigned to a project-specific Slurm account ending in _gpu (e.g., jsmith123_gpu).

  3. Submit jobs using that account and the corresponding QoS (e.g., qos_gpu).

GPU Usage Limits

The qos_gpu configuration enforces a strict usage limit:

MaxTRESPA: gres/gpu=10

This means that no more than 10 GPUs can be in use at once per account, regardless of partition or job size. If your job exceeds the limit, it will remain pending with the reason:

(QOSMaxGRESPerAccount)

To check the current GPU usage per account, administrators may use:

squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.6C %R" --qos=qos_gpu

GPU Job Submission Example

Submit a basic batch job requesting two GPUs:

#SBATCH --partition=a100
#SBATCH --qos=qos_gpu
#SBATCH --account=jsmith123_gpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=6
#SBATCH --time=24:00:00

Check assigned GPU devices:

echo $CUDA_VISIBLE_DEVICES

Monitoring GPU Usage with jobstats

Rockfish provides the jobstats tool to evaluate GPU, CPU, and memory usage for completed and running jobs.

Basic usage:

jobstats <jobid>

Output includes:

  • GPU utilization over job duration

  • Memory used per GPU

  • Node assignments

  • Efficiency metrics

For more on viewing job status and resource usage, visit: Viewing Job Status & Efficiency

Helpful Commands

  • View available GPU partitions:

    sinfo -p a100,ica100,l40s,mig_class
    

Additional Tips

  • Avoid requesting more GPUs than necessary — this may increase wait time.

  • Always confirm that your Slurm account and QoS match the partition.

  • Use interact with –gres=gpu:<N> to start a live GPU session.

Note

If you’re unsure whether your PI has GPU access, or you encounter errors submitting GPU jobs, please open a ticket on the Support page or contact the Rockfish administrators directly.