iCIS Intra Wiki
categories:             Info      -       Support      -       Software       -      Hardware       |      AllPages       -      uncategorized

Cluster

From ICIS-intra
Jump to navigation Jump to search

Users with experience using the previous version of the cluster (running Slurm version 20.05) are advised to look at the changes with respect to the previous configuration.

Overview

The Science Cluster is managed by C&CZ. It consists of compute nodes (i.e. servers) that are controlled by the Slurm Job Scheduling software. Slurm keeps track of the various compute resources (such as CPU cores, Memory, and GPUs), and allocates these to jobs according to their requirements.

In typical usage, you first log in to the login node of the Slurm 22 cluster, cnlogin22.science.ru.nl. You specify the resources required to run your program (e.g. 4 CPU cores, 10GB RAM, and 1 GPU, for at most 2 hours), along with some administrative information, in a text file called a batch script. Then you submit this batch script to Slurm, which will place your job into a queue, and will run it as soon as there are resources are available. When your job is completed, the results are saved to one or more files.

The cluster is primarily designed to run these kinds of non-interactive (or "batch") jobs. New users sometimes want to use a more interactive workflow, similar to what they are used to on a personal laptop. For example, they might wish to use a notebook-style interface such as Jupyter or RStudio, or a REPL. This is unfortunately not something the cluster is very well suited to. Interactive jobs cause the overall utilization of the cluster to be lower (because time spent waiting for user input is wasted), and it is inconvenient that you cannot accurately predict when your interactive session will start. Therefore, we strongly encourage users to try to adapt their code to run non-interactively, taking all inputs from a file.

The following sections describe the use of the cluster in more detail.

Access to Cluster Resources

Users from the department of Computer Science can make use of several of the compute nodes in the cluster, most of them with GPUs. These nodes belong to different sub-organizations: the Education Institute ("csedu") has nodes for use by students and teachers, and the Data Science section ("das") and the Institute for Computing and Information Sciences as a whole ("icis") have nodes for their respective employees. The total amount of resources available to each organization is:

Cluster Resources
Organization Nodes CPUs CPU Types Memory [GB] GPUs GPU Types
csedu 2 96 Xeon 4214 250 16 rtx_2080ti:16 (11GB)
das 3 224 Xeon 4214, Epyc 7452 941 24 rtx_2080ti:8 (11GB), rtx_3090:8 (24GB), rtx_a5000:6 (24GB), rtx_a6000:2 (48GB)
icis 2 384 Epyc 7642 1006 16 rtx_a5000:10 (24GB), rtx_a6000:6 (48GB)

Note: the cluster also contains a few older, non-GPU nodes; those resources are not included in the table above.

In order to use these resources, you must be a member of a Slurm Account that grants you access. You must always specify an account when submitting a job. The accounts for the "das" and "icis" nodes are just called das and icis. For the "csedu" nodes, there are separate accounts for courses (e.g. cseduimc030, ...) and for students working on a BSc/MSc thesis project (cseduproject). Note that you may be a member of multiple Slurm Accounts, e.g. a DaS employee with access to both the "das" and "icis" nodes, or a student partiticpating in a course and working on a thesis project. You can view your current account membership with the command sshare -U. Please submit your jobs under the account that is most appropriate for the work you are doing.

To request membership of a Slurm Account, send an email to the scientific programmer for your section, or to Kasper Brink (kbrink@cs.ru.nl). For students, please include your Science-username and the expected end date of your project, and a CC to your supervisor. Note that it will take a bit over an hour for account changes to become active on the cluster.

Job Classes

Motivation

Previously, all jobs on the cluster were treated the same, regardless of their size, expected runtime, and other jobs belonging to the same user. This lead to some undesirable behaviour:

  • Long-running jobs would cause low responsiveness of the cluster: newly submitted jobs, even short, high-priority ones, would have to wait for a very long time for existing jobs to finish.
  • A single user could hog all of the resources by running a large number of simultaneous jobs.

This was usually resolved informally by contacting the users involved, but ideally we want to configure the Slurm scheduler to minimize such issues. Starting with Slurm version 22.05 we distinguish between different classes of jobs . Each class has specific limits on job size and run time, and the "smaller" classes generally have more resources available to them, and/or a higher scheduling priority.

Job Class Specifications

For each of the organizations "csedu", "das", and "icis", there are five Job Classes, shown in the tables below. In general you should pick the "smallest" class that will fit your job, because that will allow the job to be scheduled the soonest.

These job classes don't translate neatly to Slurm concepts (we tried!), so they are implemented through a combination of Slurm Partitions (e.g. csedu and csedu-prio) and QOSs (Quality of Service-specifications, e.g. csedu-small or csedu-preempt). You can copy the specifications for your job from the last two columns of the table.

Organization "csedu"
Max per job Max for all jobs in class Specify:
Job Class CPU Mem [GB] GPU CPU Mem [GB] GPU Time [h] Preemptible --partition= --qos=
small 6 15 1 4 N csedu-prio,csedu csedu-small
small preemptible 6 15 1 Y csedu-prio,csedu csedu-preempt
normal 12 31 2 12 N csedu csedu-normal
large 32  (33%) 83  (33%) gpu:5  (31%) 48 N csedu csedu-large
preemptible Y csedu csedu-preempt
Organization "das"
Max per job Max for all jobs in class Specify:
Job Class CPU Mem [GB] GPU CPU Mem [GB] GPU Time [h] Preemptible --partition= --qos=
small 6 23 1 4 N das-prio,das das-small
small preemptible 6 23 1 Y das-prio,das das-preempt
normal 12 46 2 12 N das das-normal
large 74  (33%) 313  (33%) gpu:8, gpu:rtx_a6000:1 48 N das das-large
preemptible Y das das-preempt
Organization "icis"
Max per job Max for all jobs in class Specify:
Job Class CPU Mem [GB] GPU CPU Mem [GB] GPU Time [h] Preemptible --partition= --qos=
small 24 62 1 4 N icis-prio,icis icis-small
small preemptible 24 62 1 Y icis-prio,icis icis-preempt
normal 48 125 2 12 N icis icis-normal
large 128  (33%) 335  (33%) gpu:5, gpu:rtx_a6000:2 48 N icis icis-large
preemptible Y icis icis-preempt

Notes:

  • These limits may change in the future as we gain more experience with the scheduling behaviour.
  • Jobs in the "small (preemptible)" class can be submitted to both the <ORG>-prio and <ORG> partitions. At most one job per user will be run with increased priority at any time, but if the cluster is lightly loaded, additional jobs may be able to run at standard priority.
  • Jobs should only be run in one of the "preemptible" classes if your code is written so that it correctly handles being terminated and restarted at any time (see below).

Running Jobs

Log in to the Login Node

To run jobs on the cluster, first log in to the login node cnlogin22.science.ru.nl, using ssh (for Windows users, PuTTY is a popular ssh client). Note that cnlogin22 is actually an alias that points to one of the general-purpose clusternodes which is suitable for interactive use. The login node is accessible from the RU campus network, and from lilo.science.ru.nl. It is not accessible from outside the campus; in that case you can reach it via one of the following methods:

  • Through a VPN connection.
  • By using lilo.science.ru.nl as an ssh Jump Host (with ssh -J lilo.science.ru.nl cnlogin22. PuTTY users can configure this in the Proxy panel).

All interaction with Slurm takes place on the login node. It is normally not necessary to log in directly to the compute nodes themselves (an in fact this may be disabled in the future). Once logged in, the command sinfo gives a high level overview of the state of all the partitions in the cluster.

If you are unable to log in to the login node, log in to lilo.science.ru.nl and run the command id; check that the output includes at least one of the cluster-related unix groups (cluster{ds,das,icis,icisonly} or csedu*).

Determine Account and Resource Limits

Run sshare -U. If multiple accounts are listed, choose one that is most appropriate to the job you wish to run. If that command doesn't show any accounts while logged in to the login node, please wait for another hour before contacting support.

Next, determine the amount of CPU cores and Memory your job requires. If you are unsure, you can use the resource limits of the Job Classes as a guide (e.g. 6 cores and 15GB RAM for the "csedu" nodes). However, you should try not to request more resources than your job will actually use (e.g. if you know your job needs at most 2GB of RAM, use that as the limit). This may make it possible for your job to run sooner, and it will also allow Slurm to run more other jobs alongside it.

This is also important when it comes to the Time Limit (the maximum run time) of your job. On the one hand, you don't want to set this too short, because your job will be terminated (probably without producing results) when the maximum run time is reached. On the other hand, setting a lower time limit on your job may allow Slurm to run it sooner. This is because the scheduler uses a process called Backfilling, whereby it specifically looks for small and short jobs to fill in "holes" in its schedule, moving them ahead of larger jobs that may have a higher priority.

If your job requires one or more GPUs, these can be requested with a GRES (Generic Resources) specification. To request an arbitrary GPU where the type doesn't matter, use gpu:N (where N=1,...,8 is the number of GPUs). If you require a specific GPU type (usually due to memory size), you can specify this as e.g. gpu:rtx_a5000:N. See the Cluster Resources table for the available GPU types.

Finally, determine the smallest Job Class that will fit your job from the tables above, and make a note of the Partition and QOS specifications. You should now have values for the following job parameters: Account, Partition, QOS, Cores, Memory, (optionally) GRES, and the Time Limit.

Running an Interactive Job

To run an interactive job on the cluster enter the following command on the login node:

srun -A account -p partitions -q qos -c cores --mem memory [--gres=gresspec] -t hh:mm:ss --pty bash

Slurm will wait until the requrested resources are available, and then run a shell on one of the cluster nodes from which you can run further commands. When the time limit is reached the shell will be terminated.

Running a Batch Job

As mentioned in the overview, it's usually more convenient to run a job non-interactively. For this you can put all the job parameters in a Batch Script. For example, create a file myjob.sh with the following contents, adapted to fit your job:

#!/bin/bash
#SBATCH --account=cseduimc030
#SBATCH --partition=csedu-prio,csedu
#SBATCH --qos=csedu-small
#SBATCH --cpus-per-task=4
#SBATCH --mem=10G
#SBATCH --gres=gpu:1
#SBATCH --time=1:00:00
#SBATCH --output=myjob-%j.out
#SBATCH --error=myjob-%j.err

# Commands to run your program go here, e.g.:
python myjob.py

You can then submit your job to the queue with the command

sbatch myjob.sh

Changes w.r.t. the Previous Cluster Configuration

Users with experience using the previous version of the cluster (running Slurm version 20.05) should be aware of the following changes with respect to the old configuration:

  • The login node of the cluster is now reachable via the alias cnlogin22.science.ru.nl (previously this was the fixed node cn99).
  • Specifying an Account (with -A acct or --account=acct) is now mandatory for all jobs (previously this was only recommended).
  • We now distinguish between differenct classes of jobs, based on job size and maximum runtime. These classes differ in the amount of resources available to them in total, and in their scheduling priority. They are implemented through a combination of Partition and QOS specifications; see the Job Class Specifications for details (previously, all jobs were treated the same).
  • The default Job Class has more restrictive limits than previously was the case. For example, a job submitted with just sbatch -A cseduproject -p csedu will run in the "normal" class, with maximum resource limits of 12 cores / 31GB RAM / 2 GPUs / 12 hours (previously unrestricted).