User Tools

Site Tools


basic_slurm_usage

This article describes basic Slurm usage. Brief “how-to” topics include, in this order:

  A simple Slurm job script
  Slurm logoSubmit the job
  List jobs
  Get job details
  Suspend a job (root only)
  Resume a job (root only)
  Kill a job
  Hold a job
  Release a job
  List partitions
  Submit a job that's dependant on a prerequisite job being completed

Simple Slurm job script:

$ cat my-slurm-job.sh

#!/bin/bash

# set the number of nodes #SBATCH –nodes=1

# set max wallclock time #SBATCH –time=100:00:00

# set name of job #SBATCH –job-name=MyTestjob5

# mail alert at start, end and abortion of execution #SBATCH –mail-type=ALL

# send mail to this address #SBATCH –mail-user=user@zamren.com

# run the application

echo “In the directory: `pwd`” echo “As the user: `whoami`” echo “write this is a file“ > analysis.output sleep 60

Submit the job:

$ sbatch slurm-job.sh

Submitted batch job 106

List jobs:

$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
  106 defq      slurm-jo  rstober   R   0:04      1 atom01

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 319 AllNodes sim100M. omangete R 1:09:34 8 zm-node[004-011] Get job details:

$ scontrol show job 106 JobId=106 Name=slurm-job.sh

 UserId=rstober(1001) GroupId=rstober(1001)
 Priority=4294901717 Account=(null) QOS=normal
 JobState=RUNNING Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
 RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
 SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
 StartTime=2013-01-26T12:55:02 EndTime=Unknown
 PreemptTime=None SuspendTime=None SecsPreSuspend=0
 Partition=defq AllocNode:Sid=atom-head1:3526
 ReqNodeList=(null) ExcNodeList=(null)
 NodeList=atom01
 BatchHost=atom01
 NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
 Features=(null) Gres=(null) Reservation=(null)
 Shared=0 Contiguous=0 Licenses=(null) Network=(null)
 Command=/home/rstober/slurm/local/slurm-job.sh
 WorkDir=/home/rstober/slurm/local

Suspend a job (root only):

# scontrol suspend 135 # squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
  135  defq simple.s  rstober  S   0:10   1    atom01

Resume a job (root only):

# scontrol resume 135 # squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
  135  defq simple.s  rstober  R   0:13   1    atom01

Kill a job. Users can kill their own jobs, root can kill any job.

$ scancel 135 $ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a job:

$ squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  139      defq   simple  rstober  PD       0:00      1 (Dependency)
  138      defq   simple  rstober   R       0:16      1 atom01

$ scontrol hold 139 $ squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  139      defq   simple  rstober  PD       0:00      1 (JobHeldUser)
  138      defq   simple  rstober   R       0:32      1 atom01

Release a job:

$ scontrol release 139 $ squeue

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  139      defq   simple  rstober  PD       0:00      1 (Dependency)
  138      defq   simple  rstober   R       0:46      1 atom01

List partitions:

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 1 down* atom04 defq* up infinite 3 idle atom[01-03] cloud up infinite 2 down* cnode1,cnodegpu1 cloudtran up infinite 1 idle atom-head1

Submit a job that's dependant on a prerequisite job being completed:

Here's a simple job script. Note that the Slurm -J option is used to give the job a name.

#!/usr/bin/env bash

#SBATCH -p defq #SBATCH -J simple

sleep 60

Submit the job

$ sbatch simple.sh Submitted batch job 149

Now we'll submit another job that's dependent on the previous job. There are many ways to specify the dependency conditions, but the “singleton” is the simplest. The Slurm -d singleton argument tells Slurm not to dispatch this job until all previous jobs with the same name have completed.

$ sbatch -d singleton simple.sh Submitted batch job 150 $ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
  150 defq   simple  rstober  PD  0:00  1 (Dependency)
  149 defq   simple  rstober   R  0:17  1 atom01

Once the prerequisite job finishes the dependent job is dispatched.

$ squeue

JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
  150 defq   simple  rstober   R   0:31  1 atom01
basic_slurm_usage.txt · Last modified: 2016/04/21 13:17 by Editor