Viewing Job Status & Efficiency

sqme

View all jobs for a user (custom wrapper for squeue):

$ sqme
USER   ACCOUNT      JOBID   PARTITION  NAME       NODES  CPUS  MIN_MEMORY  TIME_LIMIT  TIME     NODELIST  ST  REASON
user   group_gpu    111111  a100       job1.sh    1      12    4000M       3-00:00:00  3:53:46  gpu14     R   None
user   group_gpu    111112  a100       job2.sh    1      12    4000M       3-00:00:00  3:09:00  gpu13     R   None

Common Pending Reasons

When a job is in the PENDING (PD) state, Slurm includes a reason to help you understand why it hasn’t started yet. You can view this using:

$ sqme

Example output:

JOBID            PARTITION  NAME      USER     ST  TIME      NODES  CPUS  REASON
500001           parallel   sim01     user01   PD  0:00         1     1    (MaxCpuPerAccount)
500002           parallel   sim02     user01   PD  0:00         1     1    (MaxCpuPerAccount)
500003           parallel   jobXYZ    user02   PD  0:00         1     1    (AssocGrpCPUMinutesLimit)
500004_[1-5]     parallel   arrayjob  user03   PD  0:00         1     1    (AssocGrpCPUMinutesLimit)
500009           parallel   depend    user05   PD  0:00         1     1    (Dependency)

Reason Codes:

  • None: No assigned reason yet.

  • Priority: Job is waiting due to other jobs with higher priority.

  • Dependency: Job is waiting on another job to complete.

  • JobArrayTaskLimit: An array job hit its concurrency limit.

  • MaxCpuPerAccount: Your group exceeded allowed CPU resources.

  • AssocGrpCPUMinutesLimit: Your group has exceeded allowed CPU core-minutes.

  • QOSMaxGRESPerUser: Requested GPU resources exceed QoS allowance.

  • MaxGRESPerAccount/User: Max GPU resources exceeded for the group or user.

For a full list of reason codes, see the official documentation: https://slurm.schedmd.com/job_reason_codes.html

scontrol show job

View detailed job info:

$ scontrol show job 1111111

JobId=1111111 JobName=job_script.sh
   UserId=example_user GroupId=example_group
   Priority=20688 QOS=qos_gpu State=RUNNING Reason=None
   RunTime=03:55:39 TimeLimit=3-00:00:00
   Partition=a100 NodeList=gpu14 NumCPUs=12
   ReqTRES=cpu=1,mem=4000M,node=1,billing=12,gres/gpu=1
   AllocTRES=cpu=12,mem=48000M,node=1,billing=12,gres/gpu=1

sacct

View historical job data:

$ sacct

JobID      JobName    Partition  State     ExitCode
111111     job1.sh    a100       TIMEOUT   0:0
111111.0   python     a100       COMPLETED 0:0
111112     job2.sh    a100       RUNNING   0:0

seff

View job efficiency:

$ seff 111111

Job ID: 111111
CPU Utilized: 00:00:00
CPU Efficiency: 0.00%
Memory Utilized: 0.00 MB
Memory Efficiency: 0.00%

reportseff

Summary view of multiple efficiency stats:

$ reportseff 111111

JobID   State      Elapsed  TimeEff   CPUEff   MemEff
111111  RUNNING    03:57:40   5.5%      ---      ---

jobstats

Note: We use jobstats, an open-source utility developed by Princeton University, to collect and visualize CPU, memory, and GPU utilization for Slurm jobs. It provides an intuitive, at-a-glance summary of resource efficiency and is particularly helpful for GPU workflows.

Visualize GPU, memory, and CPU usage:

$ jobstats 1111111

================================================================================
                           Slurm Job Statistics
================================================================================
       Job ID: 1111111
    NetID/Account: example_user/example_group_gpu
         Job Name: job_script
            State: RUNNING
            Nodes: 1
        CPU Cores: 12
     GPU utilization: 93%
     GPU memory usage: 31%