iCIS Intra Wiki
categories:             Info      -       Support      -       Software       -      Hardware       |      AllPages       -      uncategorized

ClusterHelp

From ICIS-intra
Revision as of 07:54, 12 May 2025 by Harcok (talk | contribs) (Created page with "== How to run a job on a linux cluster partition X with Slurm == You can run jobs on the cluster with the Slurm cluster software. === Concepts === A node is a single machine in the cluster. <br/> A cluster exists of a set of nodes. <br/> A partition is a defined subset of nodes of the whole cluster. <br/> Next to being a subset a 'partition' can also limit a job's resources.<br/> A job is typically run in a partition of the cluster.<br/> A job step, is a (possibly p...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to run a job on a linux cluster partition X with Slurm

You can run jobs on the cluster with the Slurm cluster software.

Concepts

A node is a single machine in the cluster.
A cluster exists of a set of nodes.
A partition is a defined subset of nodes of the whole cluster.
Next to being a subset a 'partition' can also limit a job's resources.
A job is typically run in a partition of the cluster.
A job step, is a (possibly parallel) task within a job.
Per partition, only people in certain unix groups are allowed to run jobs on this partition.


Example

E.g. for education you could run jobs in the "csedu" partition of the cluster which only contains the computing nodes cn47 and cn48. You have to be member of the "csedu" unix group to run jobs in this partition.

  # following command show information about the "csedu" partition:
  $ scontrol show -a partitions csedu
  PartitionName=csedu
  AllowGroups=csedu AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=NO QoS=N/A
  DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=cn[47-48]
  PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
  OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
  State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
  

Let's see what happens when the current user is not member of the 'csedu' unix group but tries to run a job in the 'csedu' partition:

  $ groups |grep csedu    
  # gives empty result, because current user is not member of the csedu unix group
  $ cat hello.csedu.sh
  #! /bin/bash
  #SBATCH --partition=csedu
  sleep 60
  echo "Hello world!" 
  $ sbatch hello.csedu.sh
  sbatch: error: Batch job submission failed: User's group not permitted to use this partition

If you want to run a job in the 'csedu' partition ask the owner of that partition to grant you access. For the 'csedu' partition the owner can give you access by adding your user account into the 'csedu' unix group.

The partitions "cncz" and "cnczshort" are accessible for all users, although only for test purposes. So we can modify the script to use the "cnczshort" partition instead and the execution will be successfull. For more details how to run and query this job see the C&CZ wiki page about Slurm.


More info see