iCIS Intra Wiki
categories: Info - Support - Software - Hardware | AllPages - uncategorized
ClusterHelp
How to run a job on a linux cluster partition X with Slurm
You can run jobs on the cluster with the Slurm cluster software.
Concepts
A node is a single machine in the cluster.
A cluster exists of a set of nodes.
A partition is a defined subset of nodes of the whole cluster.
Next to being a subset a 'partition' can also limit a job's resources.
A job is typically run in a partition of the cluster.
A job step, is a (possibly parallel) task within a job.
Per partition, only people in certain unix groups are allowed to run jobs on this partition.
Example
For listing nodes in partition can be done with following command:
$ sinfo -as PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST all up infinite 43/23/1/67 cn[00-03,05-06,08-25,30-34,36-37,39-42,45-48,50-54,58-59,69,71,73-74,81-82,85-88,90-95],micronode[1-5] snn up infinite 0/4/0/4 cn[04,27-28,70] rimlsfnwi up infinite 0/1/0/1 cn45 microbiol up infinite 5/0/0/5 micronode[1-5] microbiolprio up infinite 5/0/0/5 micronode[1-5] ...
To list info about all partitions
$ scontrol show -a partitions PartitionName=all AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=micronode[1-5],cn00,cn[01-03],cn[05-06],cn[08-25],cn[30-34],cn[36-37],cn[39-42],cn[45-48],cn[50-54],cn[58-59],cn69,cn71,cn[73-74],cn[81-82],cn[85-88],cn[90-95] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=SUSPEND State=UP TotalCPUs=2072 TotalNodes=67 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerCPU=1024
PartitionName=snn AllowGroups=snn,mbccn AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=UNLIMITED DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn04,cn[27-28],cn70 PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=200 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
....
Look at "Allowgroups" to see who has access to the partition. Note that the partition "all" seems to allow everyone, but that is not the case, it is a special clusterall unix group where only people from C&CZ are member.
For education you could run jobs in the "csedu" partition of the cluster which only contains the computing nodes cn47 and cn48. You have to be member of the "csedu" unix group to run jobs in this partition.
# following command show information about the "csedu" partition: $ scontrol show -a partitions csedu PartitionName=csedu AllowGroups=csedu AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cn[47-48] PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
Let's see what happens when the current user is not member of the 'csedu' unix group but tries to run a job in the 'csedu' partition:
$ groups |grep csedu # gives empty result, because current user is not member of the csedu unix group
$ cat hello.csedu.sh #! /bin/bash #SBATCH --partition=csedu sleep 60 echo "Hello world!"
$ sbatch hello.csedu.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition
If you want to run a job in the 'csedu' partition ask the owner of that partition to grant you access. For the 'csedu' partition the owner can give you access by adding your user account into the 'csedu' unix group.
The partitions "cncz" and "cnczshort" are accessible for all users, although only for test purposes. So we can modify the script to use the "cnczshort" partition instead and the execution will be successfull. For more details how to run and query this job see the C&CZ wiki page about Slurm.
More info see
- for a good introduction to Slurm see the Slurm Quickstart documentation
- usefull commands and tips on the C&CZ wiki page about Slurm
- Slurm documentation