iCIS Intra Wiki
categories:             Info      -       Support      -       Software       -      Hardware       |      AllPages       -      uncategorized

Servers

From ICIS-intra
Jump to navigation Jump to search

Servers C&CZ

  • linux login servers
  • C&CZ administrates the linux cluster nodes for the departments in the beta faculty

Servers ICIS within C&CZ linux cluster

C&CZ does the maintenance for the cluster.

About Linux clusters

Info about linux cluster nodes

 Every compute node/server has a local /scratch partition/volume.
You can use that for storing big, temporary data.

Policy

  • Use the mail alias users-of-icis-servers@science.ru.nl to which you can give notice if you require the machine for a certain time slot (e.g. article deadline, proper benchmarks with no interference of other processes).
  • Please create a directory with your username in the scratch directories, so it will not pollute the disk with single files and ownership is clear.
  • Please be considerate to other users: please keep the local directories cleaned up, and kill your processes you don't need anymore.

Access and Usage permission

For servers which are within the linux cluster managed my C&CZ you must first be granted access to one of the domains cluster-is(for DaS), cluster-ds(for DiS), or cluster-mbsd(for SwS), depending on your section, to be able to login to these machines. OII has one domain group cluster-csedu. Each domain group has access to the full cluster. They don't differ in access. These separate domain groups only exist for administrative reasons.

Because access granted to the whole cluster (cn00-cn96.science.ru.nl) you can log in to each machine in the cluster. However, you should only run jobs directly on the machines you are granted usage to. So you should ask for usage from the owner of the machine before using it.

When using the Slurm cluster software you can only run jobs on a partition of cluster machines when you are added to the unix groups which are allowed to use this partition. So with the Slurm software usage is controlled by granting access to these unix groups.

To get access to a domain group contact the Support_Staff which can arrange this at C&CZ. These domain clusters are only editable and viewable by C&CZ.

When a person is added to one of the domain clusters he/she will also be added to the mailing list users-of-icis-servers@science.ru.nl for the policy as described in previous section. Once added to this mailinglist you can view its contents on dhz.science.ru.nl.

How to run a job on a linux cluster partition X with Slurm

You can run jobs on the cluster with the Slurm cluster software.

A node is a single machine in the cluster.
A cluster exists of a set of nodes.
A partition is a defined subset of nodes of the whole cluster.
Next to being a subset a 'partition' can also limit a job's resources.
A job is typically run in a partition of the cluster.
A job step, is a (possibly parallel) task within a job.
Per partition, only people in certain unix groups are allowed to run jobs on this partition.

E.g. for education you could run jobs in the "csedu" partition of the cluster which only contains the computing nodes cn47 and cn48. You have to be member of the "csedu" unix group to run jobs in this partition.

  # following command show information about the "csedu" partition:
  $ scontrol show -a partitions csedu
  PartitionName=csedu
  AllowGroups=csedu AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=NO QoS=N/A
  DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
  MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
  Nodes=cn[47-48]
  PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
  OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
  State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
  JobDefaults=(null)
  DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
  

Let's see what happens when the current user is not member of the 'csedu' unix group but tries to run a job in the 'csedu' partition:

  $ groups |grep csedu    
  # gives empty result, because current user is not member of the csedu unix group
  $ cat hello.csedu.sh
  #! /bin/bash
  #SBATCH --partition=csedu
  sleep 60
  echo "Hello world!" 
  $ sbatch hello.csedu.sh
  sbatch: error: Batch job submission failed: User's group not permitted to use this partition

If you want to run a job in the 'csedu' partition ask the owner of that partition to grant you access. For the 'csedu' partition the owner can give you access by adding your user account into the 'csedu' unix group.

The partitions "cncz" and "cnczshort" are accessible for all users, although only for test purposes. So we can modify the script to use the "cnczshort" partition instead and the execution will be successfull. For more details how to run and query this job see the C&CZ wiki page about Slurm.


More info see:

Administrator details

  • access can be granted by contacting one of the scientific programmers
  • access is immediately granted to the whole cluster (cn00-cn96.science.ru.nl) however you are only allowed to use the machines you are granted usage to.
  • there are two email addresses for the ICIS cluster node machines:


Old info which recently changed: (for new setting see the new Cluster page.)

  • access to the cluster is controlled by domain groups:
    • iCIS has three domain groups, one per section, to grant people access to the cluster: cluster-is(for DaS), cluster-ds(for DiS), or cluster-mbsd(for SwS)
    • OII has one domain group cluster-csedu

Overview servers

Overview servers for education

contact Kasper Brink

 cn47.science.ru.nl	OII node (Supermicro, 2 x Intel Xeon Silver 4214 2.2 GHz, 128 GB, 8x GPU)
 cn48.science.ru.nl	OII node (Supermicro, 2 x Intel Xeon Silver 4214 2.2 GHz, 128 GB, 8x GPU)

OII has one domain group cluster-csedu. Add students to this domain group for access to the cluster. Usage of the Slurm partition 'csedu' is only allowed by member of the 'csedu' unix group.

Overview servers per section

  • SwS - contact Harco Kuppens
  
 Cluster: 
    
  We have limited access to the clusters 
    
    * slurm20 with login node cnlogin20.science.ru.nl and 
    * slurm22 with login node cnlogin22.science.ru.nl
       
    For info about cluster and its slurm software to schedule jobs,
    see  https://cncz.science.ru.nl/en/howto/slurm/ which contains also a slurm starter tutorial.
   
 * nodes for research
    
    Both clusters have an ICIS partition that contains machines that belong to ICIS and that we may use. 
    The ICIS partition on slurm20  is only the machine cn89.science.ru.nl (dagobert.science.ru.nl), and 
    the ICIS partition on slurm22 is cn114.science.ru.nl and cn115.science.ru.nl .
     
      icis partition on slurm20:
        cn89.science.ru.nl  cpu: 4 x Intel Xeon CPU E7-4870 v2 @ 2.30GHz 15-core, ram: 2929GB
     
      icis partition on slurm22:
        cn114.science.ru.nl  cpu: 2 x AMD EPYC 7642 48-Core Processor , ram: 500GB
        cn115.science.ru.nl  cpu: 2 x AMD EPYC 7642 48-Core Processor , ram: 500GB      
     
    For access you need to be member of the mbsd group. Ask Harco Kuppens for access.
     
 * nodes for education: 
   
     These nodes are all in the slurm22 cluster and are bought for education purpose by Sven-Bodo Scholz 
     for use in his course, but maybe sometimes can be used for research.
      
     contact:  Sven-Bodo Scholz
     order date: 20221202
     location: server room C&CZ, machines are managed by C&CZ
     nodes:
        cn124-cn131 : Dell PowerEdge R250 
             cpu: 1 x Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz 8-core Processor with 2 threads per core
             ram: 32 GB
             disk: 1 TB Hard drive          
        cn132 :  Dell PowerEdge R7525
             cpu: 2 x AMD EPYC 7313 16-Core Processor with 1 thread per core
             gpu: NVIDIA Ampere A30, PCIe, 165W, 24GB Passive, Double Wide, Full Height GPU
             ram: 128 GB
             disk: 480 GB SDD
             fpga: Xilinx Alveo U200 225W Full Height FPGA
     cluster partitions:
        $ sinfo | head -1; sinfo -a |grep -e 132 -e 124
        PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
        csmpi_short         up      10:00      8   idle cn[124-131]
        csmpi_long          up   10:00:00      8   idle cn[124-131]
        csmpi_fpga_short    up      10:00      1   idle cn132
        csmpi_fpga_long     up   10:00:00      1   idle cn132 
  
 Alternate servers: 
    
   Several subgroups bought servers at alternate and are doing system administration themselves.
   The server is meant for that subgroup, but you can always try to ask for access. 
    
   - for group: Robbert Krebbers                themelio.cs.ru.nl
        
       contact: Ike Muller   
       order date: 2021119
       location: server room Mercator 1  
        cpu: AMD Ryzen TR 3960X  - 24 core - 4,5GHz max 
        gpu: Asus2GB D5 GT 1030 SL−BRK (GT1030-SL-2G-BRK)
        ram: 128GB 3200  corsair 3200−16 Veng. PRO SL 
        disk: SSD 1TB Samsung 980 Pro
        moederbord: GiBy TRX40 AORUS MASTER
      
   - for group: Sebastian Junges           
             
        order date: 20231120                      bert.cs.ru.nl
        contact: ?
        location: server room mercator 1
        cpu: AMD Ryzen™ Threadripper™ PRO 5965WX Processor - 24 cores, 3,8GHz (4,5GHz turbo boost) 
             48 threads, 128MB L3 Cache, 128 PCIe 4.0 lanes,
        gpu: ASUS DUAL GeForce RTX 4070 OC grafische kaart , 12 GB (GDDR6X)
             (1x HDMI, 3x DisplayPort, DLSS 3)
        ram:  512GB : 8 x Kingston 64 GB ECC Registered DDR4-3200 servergeheugen 
                      (Zwart, KSM32RD4/64HCR, Server Premier, XMP)
        disk: 2 x SAMSUNG 990 PRO, 2 TB SSD (MZ-V9P2T0BW, PCIe Gen 4.0 x4, NVMe 2.0)
        moederbord: Asus Pro WS WRX80E−SAGE SE WIFI
        
        2 x alternate server (may 2024) with specs:     UNKNOWN1.cs.ru.nl UNKNOWN2.cs.ru.nl
        order date: 20240508
        contact: ?
        location: server room mercator 1   
        cpu: AMD Ryzen 9 7900 4000 AM5 BOX 2 303,31 606,62 A 36M 
        motherboard: ASUS TUF GAMING B650−E WIFI 2 147,93 295,86 A 36M
              integrated AMD Radeon Graphics
        ram: G.Skill Flare X5 DDR5-5600 - 96GB --  D5 96GB 5600−40 Flare X5 K2 GSK 2 287,60 575,20 A 240M
        ssd: SSD 2TB 5.0/7.0G 980 PRO M.2 SAM 2 132,15 264,30 A 60M
        power supply: be quiet! STRAIGHT POWER12 750W ATX3.0 P 2 132,15 264,30 A 120M 
        cooling: Corsair 4000D Airflow TG bk ATX 2 78,43 156,86 A 24M
   - for group: Nils Jansen                   
            
        order date: 20231120                      ernie.cs.ru.nl
        contact: ?
        location: server room mercator
        cpu: AMD Ryzen™ Threadripper™ PRO 5965WX Processor - 24 cores, 3,8GHz (4,5GHz turbo boost) 
             48 threads, 128MB L3 Cache, 128 PCIe 4.0 lanes,
        gpu: inno3d geforce rtx 4090 x3 oc white , 24GB video memory, type GDDR6X, 21Gbps
        ram:  512GB : 8 x Kingston 64 GB ECC Registered DDR4-3200 servergeheugen 
                      (Zwart, KSM32RD4/64HCR, Server Premier, XMP)
        disk: 2 x SAMSUNG 990 PRO, 2 TB SSD (MZ-V9P2T0BW, PCIe Gen 4.0 x4, NVMe 2.0)
        moederbord: Asus Pro WS WRX80E−SAGE SE WIFI
   
        order date: 20201215                       (active)
        contact: Christoph Schmidl/Maris Galesloot
        location: M1.01.16
        cpu: Intel® Core i9-10980XE, 3.0 GHz (4.6 GHz Turbo Boost) socket 2066 processor  (18 cores)
        gpu: GIGABYTE GeForce RTX 3090 VISION OC 24G 
        ram: HyperX 64 GB DDR4-3200 Kit werkgeheugen
        disk: Samsung 980 PRO 1 TB SSD + WD Blue, 6 TB Harde schijf
        moederbord: ASUS ROG RAMPAGE VI EXTREME ENCORE, socket 2066 moederbord
         


  • DiS - contact Ronny Wichers Schreur
 cn108.science.ru.nl/tarzan.cs.ru.nl	DS node (Dell PowerEdge R720, 2 x Xeon E5-2670 8C 2.6 GHz, 128 GB)
 britney.cs.ru.nl                      standalone (Katharina Kohls)
 jonsnow.cs.ru.nl                      standalone (Peter Schwabe)
  • DaS - contact Kasper Brink
 cn77.science.ru.nl	IS node (Dell PowerEdge R720, 2 x Xeon E5-2670 8C 2.6 GHz, 128 GB)
 cn78.science.ru.nl	IS node (Dell PowerEdge R720, 2 x Xeon E5-2670-Hyperthreading-on 8C 2.6 GHz, 128 GB)
 cn79.science.ru.nl	IS node (Dell PowerEdge R720, 2 x Xeon E5-2670 8C 2.6 GHz, 256 GB)
 cn104.science.ru.nl	DaS node (Supermicro, 2 x Intel Xeon Silver 4214 2.2 GHz, 128 GB, 8x GPU)
 cn105.science.ru.nl	DaS node (Supermicro, 2 x Intel Xeon Silver 4214 2.2 GHz, 128 GB, 8x GPU)
 
Above servers do not seem to support Slurm.

Servers for all departments

All departments with iCIS have access to the machines bought by iCIS via the iCIS partition:

   icis partition on slurm20:
     cn89.science.ru.nl  cpu: 4 x Intel Xeon CPU E7-4870 v2 @ 2.30GHz 15-core, ram: 2929GB
    
   icis partition on slurm22:
     cn114.science.ru.nl  cpu: 2 x AMD EPYC 7642 48-Core Processor , ram: 500GB
     cn115.science.ru.nl  cpu: 2 x AMD EPYC 7642 48-Core Processor , ram: 500GB
    

dagobert.cs.ru.nl(=cn89.science.ru.nl within linux cluster, current load)

  • does support the Slurm software; but you can also run jobs directly on this machine
  • brand: Dell R920
  • os: linux Ubuntu 20.04 LTS
  • cpu: 4 processors each having 15 cores (E7-4870 v2 at 2.3 Ghz): 60 cores in total. Note: each core supports hyperthreading causing linux to report 120 cpus.
  • memory:  3.17 TB = 3170 GB = 3170751 MB = 3170751192 kB 
  • local storage: There is accessible local storage on the device, to quickly read/write files instead of your slow network mounted home directory.
/scratch
RAID mirrored, but there will be NO BACKUPS of this directory.
/scratch-striped
striped RAID volume with faster access then /scratch, but less redundancy in case of hard disk crash. Also there will be NO BACKUPS of this directory.
  • description:

ICIS recently(august 2014) acquired a new server, called Dagobert (because of its hefty price). It is a Dell R920, with quad processors (E7-4870 v2 at 2.3 Ghz) with 15 cores each (60 total) and 3 TB RAM (1600 Mhz). A few pictures of our new server are attached. This server was bought from the Radboud Research Facilities grand to explore new research directions, achieve more relevant scientific results and cooperate with local organizations. This is the first of three phases for new equipment for ICIS. For now Dagobert is absolutely the most powerful server in our whole faculty (with a heavy 4.4 kW power supply, weight in excess of 30 kg), the next best server has only 256 Gb RAM and half the processor power.

cpu details

 $ lscpu
 Architecture:          x86_64
 CPU op-mode(s):        32-bit, 64-bit
 Byte Order:            Little Endian
 CPU(s):                120
 On-line CPU(s) list:   0-119
 Thread(s) per core:    2
 Core(s) per socket:    15
 Socket(s):             4
 NUMA node(s):          4
 Vendor ID:             GenuineIntel
 CPU family:            6
 Model:                 62
 Stepping:              7
 CPU MHz:               2300.154
 BogoMIPS:              4602.40
 Virtualization:        VT-x
 L1d cache:             32K
 L1i cache:             32K
 L2 cache:              256K
 L3 cache:              30720K
 NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116
 NUMA node1 CPU(s):     1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117
 NUMA node2 CPU(s):     2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118
 NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119

memory

 $ cat /proc/meminfo | head -1
 MemTotal:       3170751192 kB