iCIS Intra Wiki
categories:             Info      -       Support      -       Software       -      Hardware       |      AllPages       -      uncategorized

ClusterHelp

From ICIS-intra
Jump to navigation Jump to search


How to run a job on a linux cluster partition X with Slurm

You can run jobs on the cluster with the Slurm cluster software.

Concepts

A node is a single machine in the cluster.
A cluster exists of a set of nodes.
A partition is a defined subset of nodes of the whole cluster.
Next to being a subset a 'partition' can also limit a job's resources.
A job is typically run in a partition of the cluster.
A job step, is a (possibly parallel) task within a job.
Per partition, only people in certain unix groups are allowed to run jobs on this partition.


Example

For listing nodes in partition can be done with following command:

   $ sinfo -as
   PARTITION       AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
   all                up   infinite       43/23/1/67  cn[00-03,05-06,08-25,30-34,36-37,39-42,45-48,50-54,58-59,69,71,73-74,81-82,85-88,90-95],micronode[1-5]
   snn                up   infinite          0/4/0/4  cn[04,27-28,70]
   rimlsfnwi          up   infinite          0/1/0/1  cn45
   microbiol          up   infinite          5/0/0/5  micronode[1-5]
   microbiolprio      up   infinite          5/0/0/5  micronode[1-5]
   ...

To list info about all partitions

   $ scontrol show -a partitions 
   PartitionName=all
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=ALL Default=NO QoS=N/A
      DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=micronode[1-5],cn00,cn[01-03],cn[05-06],cn[08-25],cn[30-34],cn[36-37],cn[39-42],cn[45-48],cn[50-54],cn[58-59],cn69,cn71,cn[73-74],cn[81-82],cn[85-88],cn[90-95]
      PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
      OverTimeLimit=NONE PreemptMode=SUSPEND
      State=UP TotalCPUs=2072 TotalNodes=67 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerNode=UNLIMITED MaxMemPerCPU=1024
   PartitionName=snn
      AllowGroups=snn,mbccn AllowAccounts=ALL AllowQos=ALL
      AllocNodes=ALL Default=NO QoS=N/A
      DefaultTime=UNLIMITED DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=cn04,cn[27-28],cn70
      PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
      OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
      State=UP TotalCPUs=200 TotalNodes=4 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
    ....
  
Look at "Allowgroups" to see who has access to the partition. Note that the partition "all" seems to allow everyone, but that is not the case, it is a special clusterall unix group where only people from C&CZ are member. 

For education you could run jobs in the "csedu" partition of the cluster which only contains the computing nodes cn47 and cn48. You have to be member of the "csedu" unix group to run jobs in this partition.

  # following command show information about the "csedu" partition:
  $ scontrol show -a partitions csedu
  PartitionName=csedu
  AllowGroups=csedu AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=NO QoS=N/A
  DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=cn[47-48]
  PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
  OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
  State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
  


Let's see what happens when the current user is not member of the 'csedu' unix group but tries to run a job in the 'csedu' partition:

  $ groups |grep csedu    
  # gives empty result, because current user is not member of the csedu unix group
  $ cat hello.csedu.sh
  #! /bin/bash
  #SBATCH --partition=csedu
  sleep 60
  echo "Hello world!" 
  $ sbatch hello.csedu.sh
  sbatch: error: Batch job submission failed: User's group not permitted to use this partition

If you want to run a job in the 'csedu' partition ask the owner of that partition to grant you access. For the 'csedu' partition the owner can give you access by adding your user account into the 'csedu' unix group.

The partitions "cncz" and "cnczshort" are accessible for all users, although only for test purposes. So we can modify the script to use the "cnczshort" partition instead and the execution will be successfull. For more details how to run and query this job see the C&CZ wiki page about Slurm.


More info see