This article describes basic Slurm usage. Brief “how-to” topics include, in this order:
A simple Slurm job script Slurm logoSubmit the job List jobs Get job details Suspend a job (root only) Resume a job (root only) Kill a job Hold a job Release a job List partitions Submit a job that's dependant on a prerequisite job being completed
Simple Slurm job script:
$ cat my-slurm-job.sh
#!/bin/bash
# set the number of nodes #SBATCH –nodes=1
# set max wallclock time #SBATCH –time=100:00:00
# set name of job #SBATCH –job-name=MyTestjob5
# mail alert at start, end and abortion of execution #SBATCH –mail-type=ALL
# send mail to this address #SBATCH –mail-user=user@zamren.com
# run the application
echo “In the directory: `pwd`” echo “As the user: `whoami`” echo “write this is a file“ > analysis.output sleep 60
Submit the job:
$ sbatch slurm-job.sh
Submitted batch job 106
List jobs:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 106 defq slurm-jo rstober R 0:04 1 atom01
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 319 AllNodes sim100M. omangete R 1:09:34 8 zm-node[004-011] Get job details:
$ scontrol show job 106 JobId=106 Name=slurm-job.sh
UserId=rstober(1001) GroupId=rstober(1001) Priority=4294901717 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02 StartTime=2013-01-26T12:55:02 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=defq AllocNode:Sid=atom-head1:3526 ReqNodeList=(null) ExcNodeList=(null) NodeList=atom01 BatchHost=atom01 NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home/rstober/slurm/local/slurm-job.sh WorkDir=/home/rstober/slurm/local
Suspend a job (root only):
# scontrol suspend 135 # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 135 defq simple.s rstober S 0:10 1 atom01
Resume a job (root only):
# scontrol resume 135 # squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 135 defq simple.s rstober R 0:13 1 atom01
Kill a job. Users can kill their own jobs, root can kill any job.
$ scancel 135 $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Hold a job:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 139 defq simple rstober PD 0:00 1 (Dependency) 138 defq simple rstober R 0:16 1 atom01
$ scontrol hold 139 $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 139 defq simple rstober PD 0:00 1 (JobHeldUser) 138 defq simple rstober R 0:32 1 atom01
Release a job:
$ scontrol release 139 $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 139 defq simple rstober PD 0:00 1 (Dependency) 138 defq simple rstober R 0:46 1 atom01
List partitions:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 1 down* atom04 defq* up infinite 3 idle atom[01-03] cloud up infinite 2 down* cnode1,cnodegpu1 cloudtran up infinite 1 idle atom-head1
Submit a job that's dependant on a prerequisite job being completed:
Here's a simple job script. Note that the Slurm -J option is used to give the job a name.
#!/usr/bin/env bash
#SBATCH -p defq #SBATCH -J simple
sleep 60
Submit the job
$ sbatch simple.sh Submitted batch job 149
Now we'll submit another job that's dependent on the previous job. There are many ways to specify the dependency conditions, but the “singleton” is the simplest. The Slurm -d singleton argument tells Slurm not to dispatch this job until all previous jobs with the same name have completed.
$ sbatch -d singleton simple.sh Submitted batch job 150 $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 150 defq simple rstober PD 0:00 1 (Dependency) 149 defq simple rstober R 0:17 1 atom01
Once the prerequisite job finishes the dependent job is dispatched.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 150 defq simple rstober R 0:31 1 atom01