Skip to content

Commit

Permalink
Merge pull request #304 from rug-cit-hpc/develop
Browse files Browse the repository at this point in the history
Merge develop branch into master in preparation for a new release.
  • Loading branch information
marieke-bijlsma authored Aug 11, 2020
2 parents a0382dc + 51c3a95 commit 5b6fd04
Show file tree
Hide file tree
Showing 302 changed files with 12,830 additions and 1,941 deletions.
7 changes: 6 additions & 1 deletion .ansible-lint
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
exclude_paths:
- "~/.ansible" # Exclude external playbooks.
- '~/.ansible' # Exclude external playbooks.
skip_list:
# We explicitly use latest combined with other tech to pin versions (e.g. Spacewalk).
- '403' # "Package installs should not use latest."
- '701' # "No 'galaxy_info' found in meta/main.yml of a role."
...
13 changes: 6 additions & 7 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,24 @@ jobs:
build:
docker:
- image: circleci/python:3

working_directory: ~/repo

steps:
- checkout

- run:
name: install dependencies
command: |
python3 -m venv venv
. venv/bin/activate
pip install ansible-lint
# run tests!
- run:
name: run tests
command: |
. venv/bin/activate
ansible-lint cluster.yml > lint_results 2>&1 || true
errors=$(grep -c "\:" lint_results)
if (( errors > 224 )); then /bin/false; fi
cat lint_results
errors=$(grep -c '^[0-9]* [A-Z].*' lint_results)
echo '###############################################'
printf 'Counted %d ansible-lint errors.\n' ${errors:-0}
echo '###############################################'
if (( errors > 1 )); then /bin/false; fi
...
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
*.pyc
*.retry
*.swp
.vault_pass.txt*
.vault*
documentation/.~lock.UMCG Research IT HPC cluster technical design.docx#
promtools/results/*
roles/hpc-cloud
Expand Down
46 changes: 27 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,42 +102,44 @@ Deploying a fully functional virtual cluster from scratch involves the following
ansible-galaxy install -r galaxy-requirements.yml
```

2. Create `.vault_pass.txt`.
2. Create a `vault_pass.txt`.

The vault passwd is used to encrypt/decrypt the ```secrets.yml``` file per cluster,
which will be created in the next step if you do not already have one.
If you have multiple HPC clusters with their own vault passwd you can have multiple vault password files.
The pattern ```.vault_pass.txt*``` is part of ```.gitignore```, so if you use ```.vault_pass.txt.[name-of-the-cluster]```
for your vault password files they will not accidentally get committed to the repo.

* To generate a new Ansible vault password and put it in ```.vault_pass.txt.[name-of-the-cluster]```, use the following oneliner:
In addition a second vault passwd is used for various files in ```group_vars/all/``` and which contain settings that are the same for all clusters.
If you have multiple HPC clusters with their own vault passwd you will have multiple vault password files.
The pattern ```.vault*``` is part of ```.gitignore```, so if put the vault passwd files in the ```.vault/``` subdir,
they will not accidentally get committed to the repo.

* To generate a new Ansible vault password and put it in ```.vault/vault_pass.txt.[name-of-the-cluster|all]```, use the following oneliner:
```bash
tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1 > .vault_pass.txt.[name-of-the-cluster]
tr -cd '[:alnum:]' < /dev/urandom | fold -w60 | head -n1 > .vault_pass.txt.[name-of-the-cluster|all]
```
* Or to use an existing Ansible vault password create ```.vault_pass.txt.[name-of-the-cluster]``` and use a text editor to add the password.
* Make sure the ```.vault_pass.txt.[name-of-the-cluster]``` is private:
* Or to use an existing Ansible vault password create ```.vault/vault_pass.txt.[name-of-the-cluster|all]``` and use a text editor to add the password.
* Make sure the ```.vault/``` subdir and it's content is private:
```bash
chmod go-rwx .vault_pass.txt.[name-of-the-cluster]
chmod -R go-rwx .vault/
```
3. Configure Ansible settings including the vault.
To create a new virtual cluster you will need ```group_vars``` and an inventory for that HPC cluster:
To create a new virtual cluster you will need ```group_vars``` and an static inventory for that HPC cluster:
* See the ```*_hosts.ini``` files for existing clusters for examples to create a new ```[name-of-the-cluster]*_hosts.ini```.
* See the ```static_inventories/*_hosts.ini``` files for existing clusters for examples to create a new ```[name-of-the-cluster]*_hosts.ini```.
* Create a ```group_vars/[name-of-the-cluster]/``` folder with a ```vars.yml```.
You'll find and example ```vars.yml``` file in ```group_vars/template/```.
To generate a new ```secrets.yml``` with new random passwords for the various daemons/components and encrypt this new ```secrets.yml``` file:
```bash
./generate_secrets.py group_vars/template/secrets.yml group_vars/[name-of-the-cluster]/secrets.yml
ansible-vault --vault-password-file=.vault_pass.txt.[name-of-the-cluster] encrypt group_vars/[name-of-the-cluster]/secrets.yml
ansible-vault --vault-id [name-of-the-cluster]@.vault/vault_pass.txt.[name-of-the-cluster] encrypt group_vars/[name-of-the-cluster]/secrets.yml
```
The encrypted ```secrets.yml``` can now safely be committed.
The ```.vault_pass.txt.[name-of-the-cluster]``` file is excluded from the repo using the ```.vault_pass.txt*``` pattern in ```.gitignore```.
The ```.vault/vault_pass.txt.[name-of-the-cluster]``` file is excluded from the repo using the ```.vault*``` pattern in ```.gitignore```.

To use use an existing encrypted ```group_vars/[name-of-the-cluster]/secrets.yml```:

* Add a ```.vault_pass.txt.[name-of-the-cluster]``` file to the root folder of this repo and use a text editor to add the vault password to this file.
* Add a ```.vault/vault_pass.txt.[name-of-the-cluster]``` file to this repo and use a text editor to add the vault password to this file.

4. Configure the Certificate Authority (CA).

Expand Down Expand Up @@ -185,7 +187,7 @@ Deploying a fully functional virtual cluster from scratch involves the following
Execute:
```
dd if=/dev/urandom bs=1 count=1024 > roles/slurm-management/files/[name-of-the-cluster]_munge.key
ansible-vault --vault-password-file=.vault_pass.txt.[name-of-the-cluster] encrypt roles/slurm-management/files/[name-of-the-cluster]_munge.key
ansible-vault --vault-id [name-of-the-cluster]@.vault/vault_pass.txt.[name-of-the-cluster] encrypt roles/slurm-management/files/[name-of-the-cluster]_munge.key
```
The encrypted ```[name-of-the-cluster]_munge.key``` can now safely be committed.
Expand All @@ -194,11 +196,17 @@ Deploying a fully functional virtual cluster from scratch involves the following
Some examples for the *Talos* development cluster:
* Configure the dynamic inventory and jumphost for the *Talos* test cluster:
```bash
export AI_INVENTORY='talos_hosts.ini'
export AI_INVENTORY='static_inventories/talos_hosts.ini'
export AI_PROXY='reception'
export ANSIBLE_VAULT_PASSWORD_FILE='.vault_pass.txt.talos'
export ANSIBLE_VAULT_IDENTITY_LIST='[email protected]/vault_pass.txt.all, [email protected]/vault_pass.txt.talos'
```
This can also be accomplished with less typing by sourcing an initialisation file, which provides the ```lor-config``` function
to configure these environment variables for a specific cluster/site:
```
. ./lor-init
lof-config talos
```
* Firstly
* Firstly,
* Create local admin accounts, which can then be used to deploy the rest of the playbook.
* Deploy the signed hosts keys.
Without local admin accounts we'll need to use either a ```root``` account for direct login or the default user account of the image used to create the VMs.
Expand Down
140 changes: 74 additions & 66 deletions cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@
# Order of deployment required to prevent chicken versus the egg issues:
# 0. For all deployment phases:
# export AI_PROXY="${jumphost_name}"
# export AI_INVENTORY="${cluster_name}_hosts.ini"
# export AI_INVENTORY="static_inventories/${cluster_name}_hosts.ini"
# ANSIBLE_VAULT_PASSWORD_FILE=".vault_pass.txt.${cluster_name}"
#
# 1. Use standard CentOS cloud image user 'centos' or 'root' user and without host key checking:
# export ANSIBLE_HOST_KEY_CHECKING=False
# ansible-playbook -i inventory.py -u centos -l 'jumphost,cluster' single_role_playbooks/admin-users.yml
Expand All @@ -17,14 +16,29 @@
# ansible-playbook -i inventory.py -u [admin_account] cluster.yml
# This will configure:
# A. Jumphost first as it is required to access the other machines.
# B. SAI as it is required to
# * configure layout on shared storage devices used by other machines.
# * configure Slurm control and Slurm database.
# C. DAI
# D. UI
# E. Compute nodes
# F. Documentation server
# B. Basic roles for all cluster machines part 1:
# * Roles that do NOT require regular accounts or groups to be present.
# C. An LDAP with regular user accounts, which may be required for additional roles.
# (E.g. a chmod or chgrp for a file/folder requires the corresponding user or group to be present.)
# D. Basic roles for all cluster machines part 2:
# * Roles that DO depend on regular accounts and groups.
# E. SAI as it is required to:
# * Configure layout on shared storage devices used by other machines.
# * Configure Slurm control and Slurm database.
# F. DAI
# G. UI
# H. Compute nodes
# I. Documentation server
#

#
# Dummy play to ping jumphosts and establish a persisting SSH connection
# before trying to connect to the machines behind the jumphost,
# which may otherwise fail when SSH connection multiplexing is used.
#
- name: 'Dummy play to ping jumphosts and establish a persistent SSH connection.'
hosts: jumphost

- name: 'Sanity checks before we start.'
hosts: all
pre_tasks:
Expand All @@ -47,7 +61,7 @@
- sshd
- node_exporter
- {role: geerlingguy.security, become: true}
- prom_proxy
- {role: grafana_proxy, when: ansible_hostname == 'airlock'}
- regular-users
tasks:
- name: 'Install cron job to reboot jumphost regularly to activate kernel updates.'
Expand All @@ -61,106 +75,100 @@
cron_file: reboot
become: true

- name: 'B. Roles for SAIs.'
- name: 'B. Basic roles for all cluster machines part 1.'
hosts:
- sys-admin-interface
- cluster
roles:
- admin-users
- ssh_host_signer
- ssh_known_hosts
- spacewalk_client
- logins
- figlet_hostname
- mount-volume
- ldap
- figlet_motd
- node_exporter
- static-hostname-lookup
- cluster
- sshd
- resolver
- shared_storage
- coredumps

- name: 'C. Create LDAP account server.'
hosts:
- ldap-server
roles:
- role: openldap
when:
- use_ldap | default(true, true) | bool
- create_ldap | default(false, true) | bool

- name: 'D. Basic roles for all cluster machines part 2.'
hosts:
- cluster
roles:
- ldap # client
- sshd
- regular-users
- shared_storage

- hosts: slurm-management
- name: 'E. Roles for SAIs.'
hosts:
- sys-admin-interface
roles:
- mount-volume
- slurm-management
- prom_server
- {role: cadvisor, become: true}
- grafana
- cadvisor
vars:
# These variables are needed by the mariadb role.
# Which is a depencency of the slurm-management role.
# See roles/slurm/meta/main.yml
hostname_node0: "{{ ansible_hostname }}"
ip_node0: "{{ ansible_default_ipv4['address'] }}"

- name: 'C. Roles for DAIs.'
- name: 'F. Roles for DAIs.'
hosts: deploy-admin-interface
roles:
- admin-users
- ssh_host_signer
- ssh_known_hosts
- spacewalk_client
- logins
- figlet_hostname
- mount-volume
- build-environment
- ldap
- node_exporter
- static-hostname-lookup
- cluster
- sshd
- resolver
- shared_storage
- regular-users
- envsync

- name: 'D. Roles for UIs.'
- name: 'G. Roles for UIs.'
hosts: user-interface
roles:
- admin-users
- ssh_host_signer
- ssh_known_hosts
- spacewalk_client
- logins
- figlet_hostname
- ldap
- node_exporter
- static-hostname-lookup
- cluster
- sshd
- resolver
- shared_storage
- build-environment
- slurm_exporter
- slurm-client
- regular-users
- sudoers
- subgroup_directories
- role: fuse-layer
when: fuse_mountpoint is defined and fuse_mountpoint | length >= 1

- name: 'E. Roles for compute nodes.'
- name: 'H. Roles for compute nodes.'
hosts: compute-vm
roles:
- admin-users
- ssh_host_signer
- ssh_known_hosts
- spacewalk_client
- logins
- figlet_hostname
- mount-volume
- build-environment
- ldap
- node_exporter
- static-hostname-lookup
- cluster
- sshd
- resolver
- shared_storage
- slurm-client
- regular-users

- name: 'F. Roles for documentation servers.'
- name: 'I. Roles for documentation servers.'
hosts:
- docs
roles:
- admin-users
- ssh_host_signer
- yum-repos
- {role: geerlingguy.repo-epel, become: true}
- sshd
- {role: geerlingguy.security, become: true}
- online_docs
tasks:
- name: 'Install cron job to reboot server regularly to activate kernel updates.'
cron:
name: 'Reboot to load new kernel.'
weekday: '1'
minute: '45'
hour: '11'
user: root
job: /bin/needs-restarting -r >/dev/null 2>&1 || /sbin/shutdown -r +60 "Restarting to apply updates..."
cron_file: reboot
become: true
...
2 changes: 1 addition & 1 deletion documentation/Gearshift_technical_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Both Grafana and Prometheus server will run inside Docker containers on this VM.
| FileSystem Quota reporting | quota.bash <br/> https://github.com/molgenis/cluster-utils | DAI | GCC |
| SLURM cluster usage reporting | slurm_report.bash <br/> https://github.com/molgenis/cluster-utils | DAI | GCC |
| Account expiration and group memberships | COmanage | SURF SCZ COmanage | GCC |
| SLURM Job Profiling | SLURM plugin & grafana | SAI/DAI? | GCC |
| SLURM Job Profiling | SLURM plugin & grafana | SAI/DAI | GCC |

#### DNS

Expand Down
1 change: 0 additions & 1 deletion galaxy-requirements.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
- src: geerlingguy.firewall
version: 2.4.0
- src: geerlingguy.postfix
- src: geerlingguy.repo-epel
- src: geerlingguy.security
...
Loading

0 comments on commit 5b6fd04

Please sign in to comment.