Merge pull request #422 from pneerincx/feature/slurm_qos_interactive

Feature: Slurm QoQ interactive
rug-cit-hpc · Jun 29, 2021 · dfe8adf · dfe8adf
2 parents ee62572 + d2c95bd
commit dfe8adf
Show file tree

Hide file tree

Showing 7 changed files with 150 additions and 57 deletions.
diff --git a/roles/online_docs/templates/mkdocs/docs/analysis.md b/roles/online_docs/templates/mkdocs/docs/analysis.md
@@ -229,17 +229,17 @@ scancel -u [your account]
 
 When you need to interact with a running job you can start an interactive session with the [srun](http://slurm.schedmd.com/srun.html) command. 
 This creates a shell on a compute node, which works the same as a shell on the User Interface except that the shell is restricted to the requested resources. 
-This is ideal for debugging/testing and prevents your work from running out of control and crashing processes from other users or vice versa. 
-Just like for the ```sbatch``` command you will need to request resources like amount of cores, amount of memory, work allocation time (walltime), etc. 
+This is ideal for debugging/testing and prevents your work from running out of control, crashing processes from other users or vice versa. 
+Just like for the ```sbatch``` command for batch jobs you will need to request resources like amount of cores, amount of memory, work allocation time (walltime), etc. for interactive jobs too. 
 E.g. to request a session for one hour:
 ```
-srun --cpus-per-task=1 --mem=1gb --nodes=1 --qos=priority --time=01:00:00 --pty bash -i
+srun --cpus-per-task=1 --mem=1gb --nodes=1 --qos=interactive --time=01:00:00 --pty bash -i
 ```
 When the requested resources are available the interactive session will start immediately. 
-To increase the chance your interactive session will start quickly, even when the cluster is relatively busy, you can request _Quality of Service_ level _priority_ with ```--qos=priority```.
+To increase the chance your interactive session will start quickly, even when the cluster is relatively busy, you can request _Quality of Service_ level _interactive_ with ```--qos=interactive```.
 
 **Essential**: the order of ```srun``` arguments is not important except that ```--pty bash -i``` must be last. 
-Any options after that are interpreted as arguments for the requested shell and not for the ```srun``` command. 
+Any options after ```--pty bash``` are interpreted as arguments for the requested shell and not for the ```srun``` command. 
 Hence the ```-i``` in the example is an argument for the ```bash``` shell.
 
 When you exit the bash shell using either the ```exit``` command or by pressing ```CTRL+d``` the interactive job will be cancelled automagically and the corresponding resources released.
@@ -253,6 +253,7 @@ Which job will be started next is determined based on
 
 1. Job priority
 2. Backfill to improve scheduling efficiency
+3. Whether a job in the queue can preempt a running job or not.
 
 #### Job priority
 
@@ -275,15 +276,25 @@ Slurm will start them before the higher priority _5 core for 1 hour_ job, which
 Please note that this is a simplified example taking only cores and time into account.
 In real life Slurm is playing sort of a multidimensional [Tetris](https://nl.wikipedia.org/wiki/Tetris) game taking other resources like memory into account too.
 
+#### Job preemption
+
+Preemption means that a job in the queue can get resources by pushing another running job out of its way. 
+For the running job that gets preempted this means it will get killed and automatically rescheduled. 
+Unless the rescheduled job can use a smart form of check pointing to resume from where it got interrupted,
+this means it will have to start all over from scratch and any resources it used up to the point it got killed & rescheduled were wasted. 
+Therefore preemption of short jobs can help to free up resources for high priority jobs on a busy cluster without wasting a lot, 
+but for long running jobs it is less suitable, because the longer the walltime, the higher the chance it gets preempted and hence the more resources got wasted.
+
 ## Quality of Service
 
-We use 5 Quality of Service (QoS) levels with 3 QoS sub-levels each. 
+We use 6 Quality of Service (QoS) levels with 3 QoS sub-levels each. 
 The QoS sub-levels are automagically determined by the system to differentiate between short versus medium versus long running jobs 
 and enforce limits on resources available to the latter to prevent long running jobs from hogging the complete cluster.
 The base QoS levels are determined by the users and these allow you to differentiate between:
 
  * jobs with lower versus higher priority
  * high performance computing versus data staging jobs
+ * batch jobs versus interactive jobs
 
 ![QoS](img/slurm_qos.svg)
 
@@ -293,13 +304,15 @@ By specifying a QoS level with higher priority you can request Slurm to re-order
 
 #### QoS levels
 
-| QoS        | Priority    | Usage factor           | Available resources                           | Shared Storage |
-|:---------- |:----------- |:---------------------- |:--------------------------------------------- |:---------------|
-| leftover   | 0           | none                   | Lots, up to the whole cluster for short jobs. | tmp only       |
-| regular    | default     | default                | Quite some, but never the whole cluster.      | tmp only       |
-| priority   | default x 2 | default x 2            | Just a few, max ~ 25 percent of the cluster.  | tmp only       |
-| panic mode | default x 2 | default x 2            | Occasionally: Just a few.                     | tmp only       |
-| ds         | default     | default                | Minimal: max 1 core + 1GB mem per job.        | tmp and prm    |
+| QoS         | Priority    | Usage Factor  | Available Resources                           | Shared Storage | Preemptable Jobs |
+|:----------- |:----------- |:------------- |:--------------------------------------------- |:-------------- |:---------------- |
+| leftover    | 0           | none          | Lots, up to the whole cluster for short jobs. | tmp only       | Yes              |
+| regular     | default     | default       | Quite some, but never the whole cluster.      | tmp only       | Only short jobs  |
+| priority    | default x 2 | default x 2   | Just a few, max ~ 25 percent of the cluster.  | tmp only       | No               |
+| panic mode  | default x 2 | default x 2   | Occasionally: Just a few.                     | tmp only       | No               |
+| interactive | default x 3 | default       | Minimal: max 1 job per user.                  | tmp only       | No               |
+| ds          | default     | default       | Minimal: max 1 core + 1GB mem per job.        | tmp and prm    | No               |
+
 
 Recent jobs determine your _fair share_ weight when calculating job priority: 
 The more resources you recently consumed the lower your priority for new jobs.
@@ -327,20 +340,24 @@ You are a cheapskate and decided to go Dutch.
 You'll consume whatever resources are _leftover_ and will accept lowest priority for your jobs.  
 The _usage factor_ is zero, so any resources consumed using this QoS level will not impact your _fair share_, 
 which is used for calculating job priority. 
+Jobs from all other QoS levels can preempt jobs in QoS level _leftover_. 
 It may take some time for this research project to complete, but hey you got it for free!
 
 #### 2. QoS regular
 
 No goofy exceptions; this is the default when no QoS level is requested explicitly.  
-Running with this QoS level will process jobs with standard priority and count for your _fair share_ accordingly.
+Running with this QoS level will process jobs with standard priority and count for your _fair share_ accordingly. 
+Medium and long running jobs cannot get preempted: once started, they will be allowed to finish 
+no matter how busy the cluster is. Short jobs may get preempted, but only by jobs in QoS _interactive_.
 
 #### 3. QoS priority
 
 You are working on multiple projects simultaneously and have a lot of jobs in the queue already, 
 but are eager to get the results for jobs submitted with this QoS level first.  
 The total amount of resources available to this QoS level is limited and 
 your _fair share_ factor is charged double the amount of (normalised) resources as compared to when using QoS ```regular```,
-so choose wisely what you submit with QoS level ```priority```.
+so choose wisely what you submit with QoS level ```priority```. 
+Jobs cannot get preempted by others: once started, they will be allowed to finish.
 
 #### 4. QoS panic mode
 
@@ -365,7 +382,19 @@ the following rules apply:
    Using these additional resources we can then either increase the capacity to process jobs faster using QoS level ```regular``` 
    or create a dedicated QoS level with increased _fair share_ ratio depending on investment. (minimal investment 10K euro)
 
-#### 5. QoS ds
+#### 5. QoS interactive
+
+A dedicated QoS level for interactive jobs. These jobs will get super mega hyper priority as staring at a terminal waiting for a session to start isn't fun.  
+You can have only one job in QoS _interactive_ otherwise it would not be interactive anymore. 
+There is no _medium_ nor _long_ QoS sub-level for interactive jobs: 
+if you need more than 6 hours it is either no longer interactive work or it is not healthy and you need to get yourself a break! 
+Jobs in QoS _interactive-short_ cannot get preempted themselves and can preempt jobs in QoS _regular-short_ & _leftover_.
+Interactive jobs will have a bash ```${TMOUT}``` environment variable set to 30 minutes, so you can get a quick coffee break, 
+but do not try to keep in-active _interactive_ sessions alive by running silly programs that waste CPU cycles: 
+Logout if you go to a meeting and start a new interactive job when you get back instead. 
+Wasting more than 30 minutes worth of resources in this QoS may lead to a temporary ban.
+
+#### 6. QoS ds
 
 QoS dedicated for **d**ata **s**taging and the only one where jobs can access both _tmp_ as well as _prm_ shared storage systems.  
 To prevent abuse jobs can only use a single core and 1 GB memory max, 
@@ -498,7 +527,7 @@ The compute nodes of this cluster do not have local scratch disks.
 If your workload uses a random IO pattern that produces too much load on a shared file system,
 you should consider using a different algorithm or different cluster.
 {% endif %}
-## Debugging and Frequent Asked Question (FAQs)
+## Debugging and Frequent Asked Questions (FAQs)
 
 #### Q: How do I know what environment is available to my job on an execution host?
 

diff --git a/roles/slurm_management/files/slurm.epilog b/roles/slurm_management/files/slurm.epilog
@@ -1,15 +1,17 @@
 #!/bin/bash
 
-if [ -z "${SLURM_JOB_ID}" ]; then
+if [[ -z "${SLURM_JOB_ID}" ]]; then
     logger -s "WARN: SLURM_JOB_ID is empty or unset in SLURM epilog."
     exit 0
 fi
 
 #
 # Cleanup job's private tmp dir.
 #
-TMPDIR="/local/${SLURM_JOB_ID}/"
-rm -rf "${TMPDIR}"
+TMPDIR="/local/${SLURM_JOB_ID:-missing_SLURM_JOB_ID}/"
+if [[ -e "${TMPDIR}" ]]; then
+    rm -Rf "${TMPDIR}"
+fi
 
 #
 # Append resource usage stats to job's *.out file if we have an STDOUT file.
@@ -19,7 +21,7 @@ SCONTROL_JOB_INFO="$(scontrol show job ${SLURM_JOB_ID})"
 SLURM_JOB_STDOUT="$(printf '%s\n' "${SCONTROL_JOB_INFO}" | grep 'StdOut=' | sed 's/[[:space:]]*StdOut=//')"
 SLURM_JOB_NODE="$(printf '%s\n' "${SCONTROL_JOB_INFO}" | grep 'BatchHost=' | sed 's/[[:space:]]*BatchHost=//')"
 SLURM_JOB_STDOUT="$(echo "${SLURM_JOB_STDOUT}" | sed "s/%N/${SLURM_JOB_NODE}/")"
-if [ -w "${SLURM_JOB_STDOUT}" ]; then
+if [[ -w "${SLURM_JOB_STDOUT}" ]]; then
 	sformat='JobId,Elapsed,AllocCPUs,AveCPU,ReqMem,MaxVMSize,MaxRSS,MaxDiskRead,MaxDiskWrite'
 	echo '#################################################################################################################' >> "${SLURM_JOB_STDOUT}"
 	echo '# Job details recorded by SLURM job epilog using sacct.                                                         #' >> "${SLURM_JOB_STDOUT}"

diff --git a/roles/slurm_management/files/slurm.prolog b/roles/slurm_management/files/slurm.prolog
@@ -1,34 +1,46 @@
 #!/bin/bash
 
-#
-# Make sure we are successful in making tmp dirs in /local.
-# When this failed the job should not continue as SLURM will default to /tmp,
-# which is not suitable for heavy random IO nor large data sets.
-# Hammering /tmp may effectively result in the node going down.
-# When the prolog fails the node will be set to state=DRAIN instead.
-#
-
-if [ -z "${SLURM_JOB_ID}" ]; then
+if [[ -z "${SLURM_JOB_ID}" ]]; then
     logger -s "FATAL: SLURM_JOB_ID is empty or unset in SLURM prolog."
     exit 1
+elif [[ -z "${SLURM_JOB_QOS}" ]]; then
+    logger -s "FATAL: SLURM_JOB_QOS is empty or unset in SLURM prolog."
+    exit 1
 #else
-#    logger -s "DEBUG: Found SLURM_JOB_ID ${SLURM_JOB_ID} in SLURM prolog."
+#    logger -s "DEBUG: Found SLURM_JOB_ID ${SLURM_JOB_ID} and SLURM_JOB_QOS ${SLURM_JOB_QOS} in SLURM prolog."
 fi
 
 set -e
 set -u
 
+LOCAL_SCRATCH_DIR='/local'
 #
 # Check if local scratch dir is mountpoint and hence not a dir on the system disk.
 #
-LOCAL_SCRATCH_DIR='/local'
-if [ $(stat -c '%d' "${LOCAL_SCRATCH_DIR}") -eq $(stat -c '%d' "${LOCAL_SCRATCH_DIR}/..") ]; then
-    logger -s "FATAL: local scratch disk (${LOCAL_SCRATCH_DIR}) is not mounted."
-    exit 1
-#else
-#    logger -s "DEBUG: local scratch disk (${LOCAL_SCRATCH_DIR}) is mounted."
+if [[ $(stat -c '%d' "${LOCAL_SCRATCH_DIR}") -eq $(stat -c '%d' "${LOCAL_SCRATCH_DIR}/..") ]]; then
+    if [[ "${SLURM_JOB_QOS}" =~ ^ds.* ]]; then
+        #
+        # For the data staging QoS "ds", which executes jobs only on the UI,
+        # a dedicated tmp dir per job may be absent as not all UIs have a /local mount.
+        #
+        logger -s "WARN: local scratch disk (${LOCAL_SCRATCH_DIR}) is not mounted."
+    else
+        #
+        # Make sure we can create tmp dirs in /local on compute nodes.
+        # When this fails the job must not continue as SLURM will default to /tmp,
+        # which is not suitable for heavy random IO nor large data sets.
+        # Hammering /tmp may effectively result in the node going down.
+        # When the prolog fails the node will be set to state=DRAIN instead.
+        #
+        logger -s "FATAL: local scratch disk (${LOCAL_SCRATCH_DIR}) is not mounted."
+        exit 1
+    fi
+else
+    #
+    # Create dedicated tmp dir for this job.
+    #
+    TMPDIR="${LOCAL_SCRATCH_DIR}/${SLURM_JOB_ID}/"
+    #logger -s "DEBUG: local scratch disk (${LOCAL_SCRATCH_DIR}) is mounted. Trying to create ${TMPDIR} ..."
+    mkdir -m 700 -p "${TMPDIR}" || logger -s "FATAL: failed to create ${TMPDIR}."
+    chown "${SLURM_JOB_USER}" "${TMPDIR}" || logger -s "FATAL: failed to chown ${TMPDIR}."
 fi
-
-TMPDIR="${LOCAL_SCRATCH_DIR}/${SLURM_JOB_ID}/"
-mkdir -m 700 -p "${TMPDIR}" || logger -s "FATAL: failed to create ${TMPDIR}."
-chown "${SLURM_JOB_USER}" "${TMPDIR}" || logger -s "FATAL: failed to chown ${TMPDIR}."
diff --git a/roles/slurm_management/files/slurm.taskprolog b/roles/slurm_management/files/slurm.taskprolog
@@ -1,30 +1,44 @@
 #!/bin/bash
 
-#
-# Make sure we have a tmp dir in /local.
-# When this failed the job should not continue as SLURM will default to /tmp,
-# which is not suitable for heavy random IO nor large data sets.
-# Hammering /tmp may effectively result in the node going down.
-# When the prolog fails the node will be set to state=DRAIN instead.
-#
-
 if [ -z "${SLURM_JOB_ID}" ]; then
     logger -s "FATAL: SLURM_JOB_ID is empty or unset in SLURM task prolog."
     exit 1
+elif [[ -z "${SLURM_JOB_QOS}" ]]; then
+    logger -s "FATAL: SLURM_JOB_QOS is empty or unset in SLURM task prolog."
+    exit 1
 fi
 
 set -e
 set -u
 
+#
+# Make sure we have a tmp dir in /local on compute nodes.
+# When this failed the job should not continue as SLURM will default to /tmp,
+# which is not suitable for heavy random IO nor large data sets.
+# Hammering /tmp may effectively result in the node going down.
+# When the prolog fails the node will be set to state=DRAIN instead.
+#
+# For the data staging QoS "ds", which executes jobs only on the UI,
+# a dedicated tmp dir per job may be absent as not all UIs have a /local mount.
+#
 TMPDIR="/local/${SLURM_JOB_ID}/"
-
-if [ ! -d "${TMPDIR}" ]; then
+if [[ ! -d "${TMPDIR}" ]] && [[ ! "${SLURM_JOB_QOS}" =~ ^ds.* ]]; then
     logger -s "FATAL: TMPDIR ${TMPDIR} is not available in SLURM task prolog."
     exit 1
+else
+    #
+    # STDOUT from this task prolog is used to initialize the job task's env,
+    # so we need to print the export statements to STDOUT.
+    #
+    echo "export TMPDIR=${TMPDIR}"
 fi
 
 #
-# STDOUT from this task prolog is used to initialize the job task's env,
-# so we need to print the export statements to STDOUT.
+# Set TMOUT to configure automagic logout from interactive sessions
+# after 30 minutes of inactivity.
 #
-echo "export TMPDIR=${TMPDIR}"
+if [[ "${SLURM_JOB_QOS}" =~ ^interactive.* ]]; then
+    echo "TMOUT=1800"
+    echo "readonly TMOUT"
+    echo "export TMOUT"
+fi
diff --git a/roles/slurm_management/templates/configure_slurm_accounting_db.bash b/roles/slurm_management/templates/configure_slurm_accounting_db.bash
@@ -77,6 +77,7 @@ sacctmgr -i modify qos Name='regular-medium' set \
     Priority=10 \
     Preempt='leftover-short,leftover-medium,leftover-long' \
     GrpSubmit=30000 MaxSubmitJobsPU=5000  MaxWall=1-00:00:00 \
+    GrpTRES=cpu={{ [1, (cluster_cores_total | float * 0.6) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.6) | int] | max }} \
     MaxTRESPU=cpu={{ [1, (cluster_cores_total | float * 0.4) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.4) | int] | max }}
 
 sacctmgr -i create qos set Name='regular-long'
@@ -103,6 +104,7 @@ sacctmgr -i create qos set Name='priority-short'
 sacctmgr -i modify qos Name='priority-short' set \
     Description='priority-short' \
     Priority=20 \
+    Preempt='leftover-short,leftover-medium,leftover-long' \
     UsageFactor=2 \
     GrpSubmit=5000  MaxSubmitJobsPU=1000   MaxWall=06:00:00 \
     MaxTRESPU=cpu={{ [1, (cluster_cores_total | float * 0.25) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.25) | int] | max }}
@@ -111,20 +113,42 @@ sacctmgr -i create qos set Name='priority-medium'
 sacctmgr -i modify qos Name='priority-medium' set \
     Description='priority-medium' \
     Priority=20 \
+    Preempt='leftover-short,leftover-medium,leftover-long' \
     UsageFactor=2 \
     GrpSubmit=2500  MaxSubmitJobsPU=500   MaxWall=1-00:00:00 \
-    GrpTRES=cpu={{ [1, (cluster_cores_total | float * 0.5) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.5) | int] | max }} \
+    GrpTRES=cpu={{ [1, (cluster_cores_total | float * 0.6) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.6) | int] | max }} \
     MaxTRESPU=cpu={{ [1, (cluster_cores_total | float * 0.2) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.2) | int] | max }}
 
 sacctmgr -i create qos set Name='priority-long'
 sacctmgr -i modify qos Name='priority-long' set \
     Description='priority-long' \
     Priority=20 \
+    Preempt='leftover-short,leftover-medium,leftover-long' \
     UsageFactor=2 \
     GrpSubmit=250   MaxSubmitJobsPU=50   MaxWall=7-00:00:00 \
-    GrpTRES=cpu={{ [1, (cluster_cores_total | float * 0.2) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.2) | int] | max }} \
+    GrpTRES=cpu={{ [1, (cluster_cores_total | float * 0.3) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.3) | int] | max }} \
     MaxTRESPU=cpu={{ [1, (cluster_cores_total | float * 0.1) | int] | max }},mem={{ [1000, (cluster_mem_total | float * 0.1) | int] | max }}
 
+#
+# QoS interactive
+#
+sacctmgr -i create qos set Name='interactive'
+sacctmgr -i modify qos Name='interactive' set \
+    Description='Highest priority Quality of Service level for interactive sessions.' \
+    Priority=30 \
+    UsageFactor=1 \
+    MaxSubmitJobsPU=1 \
+    GrpTRES=cpu=0,mem=0
+
+sacctmgr -i create qos set Name='interactive-short'
+sacctmgr -i modify qos Name='interactive-short' set \
+    Description='interactive-short' \
+    Priority=30 \
+    Preempt='leftover-short,leftover-medium,leftover-long,regular-short' \
+    UsageFactor=1 \
+    MaxSubmitJobsPU=1   MaxWall=06:00:00 \
+    MaxTRESPU=cpu={{ [1, (vcompute_max_cpus_per_node | float * 0.5) | int] | max }},mem={{ [1000, (vcompute_max_mem_per_node | float * 0.5) | int] | max }}
+
 #
 # QoS ds
 #
@@ -189,6 +213,9 @@ sacctmgr -i modify account root set \
 sacctmgr -i modify account root set \
     QOS+=ds,ds-short,ds-medium,ds-long
 
+sacctmgr -i modify account root set \
+    QOS+=interactive,interactive-short
+
 sacctmgr -i modify account root set \
     DefaultQOS=priority
 
@@ -207,6 +234,9 @@ sacctmgr -i modify account users set \
 sacctmgr -i modify account users set \
     QOS+=ds,ds-short,ds-medium,ds-long
 
+sacctmgr -i modify account users set \
+    QOS+=interactive,interactive-short
+
 sacctmgr -i modify account users set \
     DefaultQOS=regular