Viewing Job Status & Efficiency

sqme

View all jobs for a user (custom wrapper for squeue):

$ sqme
USER   ACCOUNT      JOBID   PARTITION  NAME       NODES  CPUS  MIN_MEMORY  TIME_LIMIT  TIME     NODELIST  ST  REASON
user   group_gpu    111111  a100       job1.sh    1      12    4000M       3-00:00:00  3:53:46  gpu14     R   None
user   group_gpu    111112  a100       job2.sh    1      12    4000M       3-00:00:00  3:09:00  gpu13     R   None

Common Pending Reasons

When a job is in the PENDING (PD) state, Slurm includes a reason to help you understand why it hasn’t started yet. You can view this using:

$ sqme

Example output:

JOBID    PARTITION  NAME                 USER     ST   TIME NODES REASON
-------- ---------- -------------------- -------- -- ------- ----- -----------------------
100001   a100       train                userA    PD   0:00     1 Dependency
100002   a100       batch_job            userB    PD   0:00     1 Priority
100003   a100       workflow_11          userC    PD   0:00     1 Resources
100004   a100       analysis_22          userC    PD   0:00     1 Priority
100005   l40s       preproc_01           userD    PD   0:00     1 Resources
100006   l40s       model_fit            userD    PD   0:00     1 Priority
100011   h100       training_run         userE    PD   0:00     1 QOSMaxGRESPerUser
100012   h100       inference            userE    PD   0:00     1 QOSMaxGRESPerUser
100015   h100       gpu_test             userF    PD   0:00     1 Dependency

Reason Codes:

  • None: No assigned reason yet.

  • Priority: Job is waiting due to other jobs with higher priority.

  • Dependency: Job is waiting on another job to complete.

  • JobArrayTaskLimit: An array job hit its concurrency limit.

  • MaxCpuPerAccount: Your group exceeded allowed CPU resources.

  • AssocGrpCPUMinutesLimit: Your group has exceeded allowed CPU core-minutes.

  • QOSMaxGRESPerUser: Requested GPU resources exceed QoS allowance.

  • MaxGRESPerAccount/User: Max GPU resources exceeded for the group or user.

For a full list of reason codes, see the official documentation: https://slurm.schedmd.com/job_reason_codes.html

scontrol show job

View detailed job info:

$ scontrol show job 100123
JobId=100123 JobName=my_job
   UserId=userX(0000) GroupId=research(0000) MCS_label=N/A
   Priority=4000000000 Nice=0 Account=pi_group QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:57:06 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2025-04-28T10:00:00 EligibleTime=2025-04-28T10:00:00
   AccrueTime=2025-04-28T10:00:00
   StartTime=2025-04-28T10:00:15 EndTime=2025-05-01T10:00:15 Deadline=N/A
   PreemptEligibleTime=2025-04-28T10:00:15 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-28T10:00:15 Scheduler=Backfill
   Partition=gpuA100 AllocNode:Sid=login01:123456
   ReqNodeList=(null) ExcNodeList=nodeX
   NodeList=gpu001
   BatchHost=gpu001
   NumNodes=1 NumCPUs=12 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=10G,node=1,billing=180,gres/gpu=1
   AllocTRES=cpu=12,mem=120G,node=1,billing=180,gres/gpu=1,gres/gpu:a100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/scratch/pi_group/userX
   Power=
   CpusPerTres=gpu:12
   TresPerNode=gres:gpu:1

sacct

View historical job data:

$ sacct

JobID      JobName    Partition  State     ExitCode
111111     job1.sh    a100       TIMEOUT   0:0
111111.0   python     a100       COMPLETED 0:0
111112     job2.sh    a100       RUNNING   0:0

seff

View job efficiency:

$ seff 111111

Job ID: 111111
CPU Utilized: 00:00:00
CPU Efficiency: 0.00%
Memory Utilized: 0.00 MB
Memory Efficiency: 0.00%

reportseff

Summary view of multiple efficiency stats:

$ reportseff 111111

JobID   State      Elapsed  TimeEff   CPUEff   MemEff
111111  RUNNING    03:57:40   5.5%      ---      ---

jobstats

Note: We use jobstats, an open-source utility developed by Princeton University, to collect and visualize CPU, memory, and GPU utilization for Slurm jobs. It provides an intuitive, at-a-glance summary of resource efficiency and is particularly helpful for GPU workflows.

Visualize GPU, memory, and CPU usage:

$ jobstats 1111111

================================================================================
                           Slurm Job Statistics
================================================================================
       Job ID: 1111111
    NetID/Account: example_user/example_group_gpu
         Job Name: job_script
            State: RUNNING
            Nodes: 1
        CPU Cores: 12
     GPU utilization: 93%
     GPU memory usage: 31%