Skip to content

Commit

Permalink
Running update (#2157)
Browse files Browse the repository at this point in the history
* added internal link for application documentation. minor changes

* code formatting and other minor changes

* minor changes

* code formatting and other minor changes

* Add local storage example, more updates

* add missing directives

* fix tests

---------

Co-authored-by: Rasmus Kronberg <[email protected]>
Co-authored-by: Rasmus Kronberg <[email protected]>
  • Loading branch information
3 people authored Jul 18, 2024
1 parent 2e5867c commit e032b39
Show file tree
Hide file tree
Showing 4 changed files with 338 additions and 174 deletions.
299 changes: 187 additions & 112 deletions docs/computing/running/creating-job-scripts-puhti.md
Original file line number Diff line number Diff line change
@@ -1,207 +1,282 @@
# Creating a batch job script for Puhti

A batch job script contains the definitions for the resources to be reserved for
the job and the commands the user wants to run.
A batch job script contains the definitions of the resources to be reserved for
a job and the commands the user wants to run.

[TOC]


## A basic batch job script

An example of a simple batch job script:

```
```bash
#!/bin/bash
#SBATCH --job-name=myTest
#SBATCH --account=<project>
#SBATCH --time=02:00:00
#SBATCH --mem-per-cpu=2G
#SBATCH --partition=small
##SBATCH --mail-type=BEGIN #uncomment to enable mail
#SBATCH --job-name=myTest # Job name
#SBATCH --account=<project> # Billing project, has to be defined!
#SBATCH --time=02:00:00 # Max. duration of the job
#SBATCH --mem-per-cpu=2G # Memory to reserve per core
#SBATCH --partition=small # Job queue (partition)
##SBATCH --mail-type=BEGIN # Uncomment to enable mail

module load myprog/1.2.3
module load myprog/1.2.3 # Load required modules

srun myprog -i input -o output
srun myprog -i input -o output # Run program using requested resources
```

The first line `#!/bin/bash` tells that the file should be interpreted
as a bash script.
The first line `#!/bin/bash` tells that the file should be interpreted as a
Bash script.

The lines starting with `#SBATCH` are arguments for the batch system.
These examples only use a small subset of the options. For a list of all possible
options, see the [Slurm documentation](https://slurm.schedmd.com/sbatch.html).
The lines starting with `#SBATCH` are arguments (directives) for the batch job
system. These examples only use a small subset of the options. For a list of
all possible options, see the
[Slurm documentation](https://slurm.schedmd.com/sbatch.html).

The general syntax of a `#SBATCH` option:
The general syntax of an `#SBATCH` option:

```
```bash
#SBATCH option_name argument
```

In our example,

```
```bash
#SBATCH --job-name=myTest
```

sets the name of the job. It can be used to identify a job in the queue and
other listings.
sets the name of the job to *myTest*. It can be used to identify a job in the
queue and other listings.

```
```bash
#SBATCH --account=<project>
```

sets the billing project for the job. **This argument is mandatory. Failing to
set it will cause the job to be held with the reason _AssocMaxJobsLimit_**
Please replace `<project>` with the Unix group of the project. You
can find it in [My CSC](https://my.csc.fi) under the _My projects_ tab. More
information about [billing](../../accounts/billing.md).
sets the billing project for the job. Please replace `<project>` with the Unix
group of your project. You can find it in [My CSC](https://my.csc.fi) under the
*Projects* tab. [More information about billing](../../accounts/billing.md).

The time reservation is set with option `--time`
!!! warning "Remember to specify the billing project"
The billing project argument is mandatory. Failing to
set it will cause an error:

```
#SBATCH --time=10:00:00
```
```text
sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```

Time is provided using the format __`hh:mm:ss`__ (optionally __`d-hh:mm:ss`__, where
__d__ is _days_). The maximum time depends on the selected queue. When the time
reservation ends, the job is terminated regardless of whether it is finished or not, so the time
reservations should be sufficiently long. A job consumes billing units according to
its actual runtime.
The runtime reservation is set with option `--time`:

```bash
#SBATCH --time=02:00:00
```

Time is provided using the format `hh:mm:ss` (optionally `d-hh:mm:ss`, where
`d` is _days_). The maximum runtime depends on the selected queue. **When the
time reservation ends, the job is terminated regardless of whether it has
finished or not**, so the time reservations should be sufficiently long. Note
that a job consumes billing units according to its actual runtime.

```bash
#SBATCH --mem-per-cpu=2G
```

sets the required memory per requested CPU core. If the requested
memory is exceeded, the job is terminated.
sets the required memory per requested CPU core. If the requested memory is
exceeded, the job is terminated.

The partition needs to be set according to the job requirements.
The partition (queue) needs to be set according to the job requirements. For
example:

```
```bash
#SBATCH --partition=small
```

!!! Note "Available partitions"
[The available batch job partitions](batch-job-partitions.md).
!!! info "Available partitions"
[See the available batch job partitions](batch-job-partitions.md).

The user can be notified by email when the jobs starts by using the `--mail-type` option
The user can be notified by email when the job *starts* by using the
`--mail-type` option

```
##SBATCH --mail-type=BEGIN #uncomment to enable mail
```bash
##SBATCH --mail-type=BEGIN # Uncomment to enable mail
```

Other useful arguments (multiple arguments are separated by a comma) are `END` and `FAIL`.
By default, the email will be sent to the email address of your csc account.
This can be overridden with the `--mail-user=` option.
Other useful arguments (multiple arguments are separated by a comma) are `END`
and `FAIL`. By default, the email will be sent to the email address linked to
your CSC account. This can be overridden with the `--mail-user=` option.

After defining all required resources in the batch job script, set up the
environment. Note that for modules to be available for batch jobs, they need to be loaded in
the batch job script.
After defining all required resources in the batch job script, set up the
required environment by loading suitable modules. Note that for modules to be
available for batch jobs, they need to be loaded in the batch job script.
[More information about environment modules](../modules.md).

```
```bash
module load myprog/1.2.3
```

Finally, we launch our program using the `srun` command:
Finally, we launch our application using the requested resources with the
`srun` command:

```
```bash
srun myprog -i input -o output
```


## Serial and shared memory batch jobs

Serial and shared memory jobs need to be run within one computing node. Thus, the jobs are limited by the hardware specifications available in the nodes. In Puhti, each node has two processors with 20 cores each, i.e. 40 cores in total.

The Sbatch option `--cpus-per-task` is used the define the number of computing cores that the batch job task uses. The option `--nodes=1` ensures that all the reserved cores are located in the same node, and `--ntasks=1` assigns all reserved computing cores for the same task.

In thread-based jobs, the `--mem` option is recommended for memory reservation. This option defines the amount of memory required per node. Note that if you use `--mem-per-cpu` option instead, the total memory request of the job will be the memory request multiplied by the number of reserved cores (`--cpus-per-task`). Thus, if you modify the number of cores, also check the memory reservation.

In most cases, it is the most efficient to match the number of reserved cores to the number of threads or processes the application uses. Check the documentation for application-specific details.

If the application has a command line option to set the number of threads/processes/cores, it should always be used to make sure the software behaves as expected. Some applications use only one core by default, even if more are reserved.

Some other applications may try to use all cores in the node even if only some are reserved. The environment variable `$SLURM_CPUS_PER_TASK` can be used instead of a number. This way, the command does not need to be edited if the `--cpus-per-task` is changed. Use the environment variable `OMP_NUM_THREADS` to set the number of threads the program uses.

Serial and shared memory jobs need to be run within one compute node. Thus, the
jobs are limited by the hardware specifications available in the nodes. On
Puhti, each node has two processors with 20 cores each, i.e. 40 cores in total.
[See more technical details about Puhti](../systems-puhti.md).

The `#SBATCH` option `--cpus-per-task` is used to define the number of
computing cores that the batch job task uses. The option `--nodes=1` ensures
that all the reserved cores are located in the same node, and `--ntasks=1`
assigns all reserved computing cores for the same task.

In thread-based jobs, the `--mem` option is recommended for memory reservation.
This option defines the amount of memory required *per node*. Note that if you
use `--mem-per-cpu` option instead, the total memory request of the job will be
the memory requested per CPU core (`--mem-per-cpu`) multiplied by the number of
reserved cores (`--cpus-per-task`). **Thus, if you modify the number of cores,
also check that the memory reservation is appropriate.**

Typically, the most efficient practice is to match the number of reserved cores
(`--cpus-per-task`) to the number of threads or processes the application uses.
However, always [check the application-specific details](../../apps/index.md).

If the application has a command-line option to set the number of
threads/processes/cores, it should always be used to ensure that the software
behaves as expected. Some applications use only one core by default, even if
more are reserved.

Other applications may try to use all cores in the node, even if only some
are reserved. The environment variable `$SLURM_CPUS_PER_TASK`, which stores the
value of `--cpus-per-task`, can be used instead of a number when specifying the
amount of cores to use. This is useful as the command does not need to be
modified if the `--cpus-per-task` is changed later.

Finally, use the environment variable `OMP_NUM_THREADS` to set the number of
threads the application uses. For example,

```bash
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
```

## MPI-based batch jobs

In MPI jobs, each task has its own memory allocation. Thus, the tasks can be distributed between nodes.
In MPI jobs, each task has its own memory allocation. Thus, the tasks can be
distributed over multiple nodes.

Set the number of MPI tasks:
Set the number of MPI tasks with:

```
--ntasks=<number_of_mpi_tasks>
``` bash
#SBATCH --ntasks=<number_of_mpi_tasks>
```

If more fine-tuned control is required, the exact number of nodes and number of tasks per node can be specified with
`--nodes` and `--ntasks-per-node`, respectively.
If more fine-tuned control is required, the exact number of nodes and number of
tasks per node can be specified with `--nodes` and `--ntasks-per-node`,
respectively. This is typically recommended in order to avoid tasks spreading
over unnecessary many nodes,
[see Performance checklist](./performance-checklist.md#limit-unnecessary-spreading-of-parallel-tasks-in-puhti).

It is recommended to request memory using the `--mem-per-cpu` option.


!!! Note
- MPI should **not** be started with _mpirun_ or _mpiexec_, use `srun` instead.
- A MPI module has to be loaded in the batch job script for the submission to work properly.
!!! info "Running MPI programs"
- MPI programs **should not** be started with `mpirun` or `mpiexec`. Use
`srun` instead.
- An MPI module has to be loaded in the batch job script for the program
to work properly.

## Hybrid batch jobs

In hybrid jobs, each tasks is allocated several cores. Each tasks then uses some other parallelization than MPI to do work.
The most common strategy is for every MPI-task to launch multiple threads using OpenMP.
To request more cores per MPI task, use the argument `--cpus-per-task`. The default value is one core per task.

The optimal ratio between the number of tasks and cores per tasks varies for each program, testing is required to find
the right combination for your application.
In hybrid jobs, each task is allocated several cores. Each task then uses some
parallelization, other than MPI, to do the work. The most common strategy is
for every MPI task to launch multiple threads using OpenMP. To request more
cores per MPI task, use the argument `--cpus-per-task`. The default value is
one core per task.

The optimal ratio between the number of tasks and cores per tasks varies for
each application. Testing is required to find the right combination for your
application.

!!! Note
By default, running a single task per node with multiple threads using **hpcx-mpi** will bind all threads to a single
core and no speedup will be gained. This can be fixed by setting `export OMP_PROC_BIND=true` in your job script. This
will bind the threads to different cores. Another possibility is to turn off slurms core binding with the `srun` flag `--cpu-bind=none`.
!!! info "Threads per task in hybrid MPI/OpenMP jobs"
Set the number of OpenMP threads per MPI task in your batch script using
the `OMP_NUM_THREADS` and `SLURM_CPUS_PER_TASK` environment variables:

```bash
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
```

## Additional resources in batch jobs

### Local storage
### Local storage

Some nodes in Puhti have a local fast storage available for jobs. The local storage is good for I/O-intensive programs.
Some nodes on Puhti have fast local storage space (NVMe) available for jobs.
Using local storage is recommended for I/O-intensive applications, i.e. jobs
that, for example, read and write a lot of small files.
[See more details](../disk.md#temporary-local-disk-areas).

The local storage is available on:
Local storage is available on:

* GPU nodes in the `gpu` and `gputest` partitions (Max 3600 GB per node)
* I/O nodes shared by the `small`, `large`, `longrun` and `interactive` partitions (Max 1490/3600 GB per node)
* Big Mem nodes in the `hugemem` and `hugemem_longrun` partitions (Max 5960 GB per node)
* GPU nodes in the `gpu` and `gputest` partitions (max. 3600 GB per node)
* I/O nodes shared by the `small`, `large`, `longrun` and `interactive`
partitions (max. 1490/3600 GB per node)
* BigMem nodes in the `hugemem` and `hugemem_longrun` partitions (max. 5960 GB
per node)

Request local storage using the `--gres` flag in the job submission:
Request local storage using the `--gres` flag in the batch script:

```
--gres=nvme:<local_storage_space_per_node>
```bash
#SBATCH --gres=nvme:<local_storage_space_per_node_in_GB>
```

The amount of space is given in GB (check maximum sizes from the list above). For example, to request 100 GB of storage, use option `--gres=nvme:100`. The local storage reservation is on a per node basis.
The amount of space is given in GB (check maximum sizes from the list above).
For example, to request 100 GB of storage, use option `--gres=nvme:100`. The
local storage reservation is on a per-node basis.

Use the environment variable `$LOCAL_SCRATCH` in your batch job scripts to access the local storage on each node.
Use the environment variable `$LOCAL_SCRATCH` in your batch job scripts to
access the local storage space on each node. For example, to extract a large
dataset package to the local storage:

```bash
tar xf my-large-dataset.tar.gz -C $LOCAL_SCRATCH
```

!!! Note
The local storage is emptied after the job has finished, so please move any data you want to keep to
the shared disk area.
!!! warning "Remember to recover your data"
The local storage space reserved for your job is emptied after the job has
finished. Thus, if you write data to the local disk during your job, please
remember to move anything you want to preserve to the shared disk area at
the end of your job. Particularly, the commands to move the data must be
given in the batch job script as you cannot access the local storage space
anymore after the batch job has completed. For example, to copy some output
data back to the directory from where the batch job was submitted:

```bash
mv $LOCAL_SCRATCH/my-important-output.log $SLURM_SUBMIT_DIR
```

### GPUs

Puhti has 320 NVIDIA Tesla V100 GPUs. The GPUs are available on the `gpu` and `gputest` partitions using the option:
Puhti has 320 Nvidia Tesla V100 GPUs. The GPUs are available in the `gpu` and
`gputest` partitions and can be requested with:

```bash
#SBATCH --gres=gpu:v100:<number_of_gpus_per_node>
```
--gres=gpu:v100:<number_of_gpus_per_node>
```

The `--gres` reservation is on a per node basis. There are 4 GPUs per GPU node.

Multiple resources can be requested with a comma-separated list.
The `--gres` reservation is on a per-node basis. There are 4 GPUs per GPU node.

Request both GPU and local storage:
Multiple resources can be requested with a comma-separated list. To request
both GPU and local storage:

```bash
#SBATCH --gres=gpu:v100:<number_of_gpus_per_node>,nvme:<local_storage_space_per_node>
```
--gres=gpu:v100:<number_of_gpus_per_node>,nvme:<local_storage_space_per_node>
```

For example, to request 1 GPU and 10 GB of NVMe storage the option would be `--gres=gpu:v100:1,nvme:10`.
For example, to request 1 GPU and 10 GB of NVMe storage the option would be
`--gres=gpu:v100:1,nvme:10`.

## More information

* [Puhti example batch scripts](example-job-scripts-puhti.md)
* [Available batch job partitions](batch-job-partitions.md)
* [Batch job training materials](https://csc-training.github.io/csc-env-eff/part-1/batch-jobs/)
* [Slurm documentation](https://slurm.schedmd.com/documentation.html)
Loading

0 comments on commit e032b39

Please sign in to comment.