User Tools

Site Tools


getting_started_guide

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
getting_started_guide [2016/04/20 14:31]
Editor
getting_started_guide [2017/10/19 10:53] (current)
Line 1: Line 1:
-**Getting Started**+=====Getting Started Guide=====
  
 This section shows how to login to the the system and submit a basic job on the cluster. If you do no have an account already, please apply for one by following the link [[applying_for_an_account|]] This section shows how to login to the the system and submit a basic job on the cluster. If you do no have an account already, please apply for one by following the link [[applying_for_an_account|]]
Line 24: Line 24:
  
  
-**Submitting a Job**+**Submitting Jobs using TORQUE**
  
-The cluster uses [[http://www.schedmd.com|SLURM]] for scheduling and resource management.Slurm +[[http://en.wikipedia.org/wiki/TORQUE|TORQUE]] is an open source batch queuing system that is very similar to [[http://en.wikipedia.org/wiki/Portable_Batch_System|PBS]]. Most PBS commands will work without any change. TORQUE is maintained by [[http://www.adaptivecomputing.com/products/open-source/torque/|Adaptive Computing]].
-is an open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnosticSlurm currently has been tested only under Linux.+
  
-As a cluster resource manager, Slurm provides three key functions. First,it allocates exclusive and/or non-exclusive access to resources(compute nodes) to users for some duration of time so they can perform +In order to use the HPC compute nodes, you must first log into the login nodes, and submit PBS job. The qsub command is used to submit job to the PBS queue and to request additional resourcesThe qstat command is used to check on the status of a job already in the PBS queue. To simplify submitting a job, you can create a PBS script and use the qsub and qstat commands to interact with the PBS queue.
-work. Secondit provides a framework for starting, executing, and monitoring work (normally parallel job) on the set of allocated nodesFinally, it arbitrates conflicting requests for resources by managing a queue of pending work. +
-Key commands to view the status of the cluster are+
  
-**sinfo** reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options+**Creating a PBS Script**
  
 +To set the parameters for your job, you can create a control file that contains the commands to be executed. Typically, this is in the form of a PBS script. This script is then submitted to PBS using the qsub command.
  
-**squeue** reports the state of jobs or job steps. It has wide variety of filteringsorting, and formatting optionsBy defaultit reports the running jobs in priority order and then the pending jobs in priority order.+Here is sample PBS filenamed myjobs.pbsfollowed by an explanation of each line of the file.
  
 +<code>
 + #!/bin/bash
 + #PBS -l nodes=1:ppn=2 
 + #PBS -l walltime=00:00:59 
 + cd /home/rcf-proj3/pv/test/ 
 + source /usr/usc/sas/setup.sh 
 + sas my.sas 
 +</code>
  
-**srun** is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor countspecific nodes to use or not useand specific node characteristics (so much memory, disk spacecertain required featuresetc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.+The first line in the file identifies which shell will be used for the job.  In this example, bash is used. 
 +The second line specifies the number of nodes and processors desired for this job. In this example, one node with two processors is being requested. 
 +The third line in the PBS file states how much wall-clock time is being requestedIn this example 59 seconds of wall time have been requested. 
 +The fourth line tells the HPC cluster to access the directory where the data is located for this job. In this examplethe cluster is instructed to change the directory to the /home/rcf-proj3/pv/test/ directory. 
 +The fifth line tells the cluster which program you would like to use to analyze your data. In this examplethe cluster sources the environment for SAS. 
 +The sixth line tells the cluster to run the program. In this exampleit runs SASspecifying my.sas as the argument in the current directory, /home/rcf-proj3/pv/test, as defined in the previous line. 
 +To submit your job without requesting additional resources, issue the command 
 +**qsub myjob.pbs**
  
 +If you have the myjob.pbs set up as explained in the example above and you want to override the default options in the myjob.pbs file, then you can use the -l parameter on the qsub command line to override the option specified in the file.
  
-**sbatch** is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.+Below are some examples of these overrides.
  
-**scancel** is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.+**Requesting Additional Wall Time**
  
-More information can be obtained from [[https://computing.llnl.gov/linux/slurm/quickstart.html|Slurm Quick User Guide]]+If you need to request more or less wall time after you have already created your PBS script, you can do this by using the qsub command. 
 + 
 +In the example script above, we have requested 59 seconds of wall time. If you realize later that your job actually requires five minutes to complete, the command 
 + 
 +**qsub -l walltime=0:05:00 myjob.pbs** 
 +will ask PBS for a limit of five minutes of wall time. If your job does not finish within the specified time, it will be terminated. 
 + 
 +**Requesting Nodes and Processors** 
 + 
 +You may also alter the number of nodes and processors requested for a job by using the qsub command. In the example script, we have requested one node with two processors, or one dual-processor node. 
 + 
 +If you later decide that you need four HPC nodes for your job but you are going to use only one of the dual-processors on each node, then use the following command: 
 +**qsub -l walltime=0:05:00,nodes=4 myjob.pbs** 
 + 
 +If you want to use both processors on each HPC node, you should use the following command: 
 +**qsub -l walltime=0:05:00,nodes=4:ppn=2 myjob.pbs** 
 + 
 +**Requesting a Specific Network** 
 + 
 +To run your job on the infiniband network, add the IB feature to your PBS script. 
 + 
 +**#PBS -l nodes=1:ppn=2:IB** 
 + 
 +MPI jobs using OpenMPI 1.6.4 or later can run on the Infiniband network. 
 + 
 +NOTE: Only one network should be specified for each job. If no network is specified. the job will be scheduled to run on whichever network is available. 
 + 
 +**Checking Job Status** 
 + 
 +To check on the status of your job, you will use the qstat command. The command 
 +**qstat –u [your username]** 
 +will show you the current status of all your submitted jobs. 
 + 
 +More information can be obtained from the [[http://docs.adaptivecomputing.com/torque/4-2-10/help.htm|Torque website]]
  
  
getting_started_guide.1461162703.txt.gz · Last modified: 2016/04/20 14:31 by Editor