Skip to content

Commit

Permalink
Merge pull request #7 from rug-cit-hpc/develop
Browse files Browse the repository at this point in the history
Reviewed develop -> master.
  • Loading branch information
pneerincx authored Jan 11, 2019
2 parents b976ecb + d09d2d6 commit c2a8a21
Show file tree
Hide file tree
Showing 187 changed files with 8,522 additions and 650 deletions.
12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
*.DS_Store
*.project
*.settings
*.md.html
*.vagrant
*.pydevproject
*.retry
*.swp
documentation/.~lock.UMCG Research IT HPC cluster technical design.docx#
.vault_pass.txt
documentation/.~lock.UMCG Research IT HPC cluster technical design.docx#
promtools/results/*
roles/hpc-cloud
roles/HPCplaybooks
roles/HPCplaybooks/*
ssh-host-ca
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

120 changes: 95 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,80 @@
# gearshift
# League of Robots

This repository contains playbooks and documentation for gcc's gearshift cluster.
## About this repo

## Git repository
All site specific configuration for the Gearshift cluster will be placed in this git repository.
This repository contains playbooks and documentation to deploy virtual Linux HPC clusters, which can be used as *collaborative, analytical sandboxes*.
All clusters were named after robots that appear in the animated sitcom [Futurama](https://en.wikipedia.org/wiki/Futurama)

## protected master.
The master branch is protected; updates will only be pushed to this branch after review.
#### Software/framework ingredients

## Ansible playbooks openstack cluster.
The main ingredients for (deploying) these clusters:
* [Ansible playbooks](https://github.com/ansible/ansible) for system configuration management.
* [OpenStack](https://www.openstack.org/) for virtualization. (Note that deploying the OpenStack itself is not part of the configs/code in this repo.)
* [Spacewalk](https://spacewalkproject.github.io/index.html) to create freezes of Linux distros.
* [CentOS 7](https://www.centos.org/) as OS for the virtual machines.
* [Slurm](https://slurm.schedmd.com/) as workload/resource manager to orchestrate jobs.

#### Protected branches
The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests.

## Clusters

This repo currently contains code and configs for the following clusters:
* Gearshift: [UMCG](https://www.umcg.nl) Research IT cluster hosted by the [Center for Information Technology (CIT) at the University of Groningen](https://www.rug.nl/society-business/centre-for-information-technology/).
* Talos: Development cluster hosted by the [Center for Information Technology (CIT) at the University of Groningen](https://www.rug.nl/society-business/centre-for-information-technology/).
* Hyperchicken: [Solve-RD](solve-rd.eu/) cluster hosted by [The European Bioinformatics Institute (EMBL-EBI)](https://www.ebi.ac.uk/) in the [Embassy Cloud](https://www.embassycloud.org/).

Deployment and functional administration of all clusters is a joined effort of the
[Genomics Coordination Center (GCC)](http://wiki.gcc.rug.nl/)
and the
[Center for Information Technology (CIT)](https://www.rug.nl/society-business/centre-for-information-technology/)
from the [University Medical Center](https://www.umcg.nl) and [University](https://www.rug.nl) of Groningen.

#### Cluster components

The clusters are composed of the following type of machines:
* **Jumphost**: security-hardened machines for SSH access.
* **User Interface (UI)**: machines for job management by regular users.
* **Deploy Admin Interface (DAI)**: machines for deployment of bioinformatics software and reference datasets without root access.
* **Sys Admin Interface (SAI)**: machines for maintenance / management tasks that require root access.
* **Compute Node (CN)**: machines that crunch jobs submitted by users on a UI.

The clusters use the following types of storage systems / folders:

| Filesystem/Folder | Shared/Local | Backups | Mounted on | Purpose/Features |
| :-------------------------- | :----------: | :-----: | :------------------- | :--------------- |
| /home/${home}/ | Shared | Yes | UIs, DAIs, SAIs, CNs | Only for personal preferences: small data == tiny quota.|
| /groups/${group}/prm[0-9]/ | Shared | Yes | UIs, DAIs | **p**e**rm**anent storage folders: for rawdata or *final* results that need to be stored for the mid/long term. |
| /groups/${group}/tmp[0-9]/ | Shared | No | UIs, DAIs, CNs | **t**e**mp**orary storage folders: for staged rawdata and intermediate results on compute nodes that only need to be stored for the short term. |
| /groups/${group}/scr[0-9]/ | Local | No | Some UIs | **scr**atch storage folders: same as **tmp**, but local storage as opposed to shared storage. Optional and available on all UIs. |
| /local/${slurm_job_id} | Local | No | CNs | Local storage on compute nodes only available during job execution. Hence folders are automatically created when a job starts and deleted when it finishes. |
| /mnt/${complete_filesystem} | Shared | Mixed | SAIs | Complete file systems, which may contain various `home`, `prm`, `tmp` or `scr` dirs. |

## Deployment phases

Deploying a fully functional virtual cluster involves the following steps:
1. Configure physical machines
2. Deploy OpenStack virtualization layer on physical machines to create an OpenStack cluster
3. Create and configure virtual machines on the OpenStack cluster to create an HPC cluster on top of an OpenStack cluster
4. Deploy bioinformatics software and reference datasets

---

### 2. Ansible playbooks OpenStack cluster
The ansible playbooks in this repository use roles from the [hpc-cloud](https://git.webhosting.rug.nl/HPC/hpc-cloud) repository.
The roles are imported here explicitely by ansible using ansible galaxy.
These roles install various docker images built and hosted by RuG webhosting. They are built from separate git repositories on https://git.webhosting.rug.nl.

## Deployment of openstack.
#### Deployment of OpenStack
The steps below describe how to get from machines with a bare ubuntu 16.04 installed to a running openstack installation.

---

1. First inport the HPC openstack roles into this playbook:

1. First import the required roles into this playbook:
```bash
ansible-galaxy install -r requirements.yml --force -p roles
ansible-galaxy install -r galaxy-requirements.yml
```

2. Generate an ansible vault password and put it in `.vault_pass.txt`. This could be done by running the following oneliner:
Expand All @@ -29,22 +83,38 @@ The steps below describe how to get from machines with a bare ubuntu 16.04 insta
tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1 > .vault_pass.txt
```

3. generate and encrypt the passwords for the various openstack components.

```bash
./generate_secrets.py
ansible-vault --vault-password-file=.vault_pass.txt encrypt secrets.yml
```
the secrets.yml can now safel be comitted. the `.vault_pass.txt` file is in the .gitignore and needs to be tranfered in a secure way.

4. Install the openstack cluster.

```bash
ansible-playbook --vault-password-file=.vault_pass.txt site.yml
```
3. Configure Ansible settings including the vault.
* To create (a new) secrets.yml:
Generate and encrypt the passwords for the various openstack components.
```bash
./generate_secrets.py
ansible-vault --vault-password-file=.vault_pass.txt encrypt secrets.yml
```
The encrypted secrets.yml can now safely be comitted.
The `.vault_pass.txt` file is in the .gitignore and needs to be tranfered in a secure way.

* To use use an existing encrypted secrets.yml add .vault_pass.txt to the root folder of this repo
and create in the same location ansible.cfg using the following template:
```[defaults]
inventory = hosts
stdout_callback = debug
forks = 20
vault_password_file = .vault_pass.txt
remote_user = your_local_account_not_from_the_LDAP
```

4. Running playbooks. Some examples:
* Install the OpenStack cluster.
```bash
ansible-playbook site.yml
```
* Deploying only the SLURM part on test cluster *Talos*
```bash
ansible-playbook site.yml -i talos_hosts slurm.yml
```

5. verify operation.

# Steps to upgrade openstack cluster.
#### Steps to upgrade openstack cluster.

# Steps to install Compute cluster on top of openstack cluster.
### 3. Steps to install Compute cluster on top of openstack cluster.
3 changes: 2 additions & 1 deletion ansible.cfg
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[defaults]
inventory = hosts
stdout_callback = debug
vault_password_file = .vault_pass.txt
78 changes: 78 additions & 0 deletions cluster.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
- name: Install roles needed for all virtual cluster components except jumphosts.
hosts: cluster
become: true
tasks:
roles:
- spacewalk_client
- ldap
- node_exporter
- cluster

- name: Install roles needed for jumphosts.
hosts: jumphost
become: true
roles:
- ldap
- cluster
- geerlingguy.security
tasks:
- cron:
name: Reboot to load new kernel.
weekday: 1
minute: 45
hour: 11
user: root
job: /bin/needs-restarting -r >/dev/null 2>&1 || /sbin/shutdown -r +60 "restarting to apply updates"
cron_file: reboot

- hosts: slurm
become: true
roles:
- prom_server
- cadvisor
- slurm

- name: Install virtual compute nodes
hosts: compute-vm
become: true
tasks:
roles:
- compute-vm
- isilon
- datahandling
- slurm-client

- name: Install user interface
hosts: interface
become: true
tasks:
roles:
- slurm_exporter
- user-interface
- datahandling
- isilon
- slurm-client

- name: Install ansible on admin interfaces (DAI & SAI).
hosts:
- imperator
- sugarsnax
become: True
tasks:
- name: install Ansible
yum:
name: ansible-2.6.6-1.el7.umcg

- name: export /home
hosts: user-interface:&talos-cluster
roles:
- nfs_home_server

- name: export /home
hosts: compute-vm&talos-cluster
roles:
- nfs_home_client

- import_playbook: users.yml
#- import_playbook: ssh-host-signer.yml
86 changes: 86 additions & 0 deletions dai.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
- hosts: deploy-admin-interface
become: true
tasks:
- name: Install OS depedencies (with yum).
yum:
state: latest
update_cache: yes
name:
#
# 'Development tools' package group and other common deps.
#
- "@Development tools"
- libselinux-devel
- kernel-devel
- gcc-c++
#
# Slurm dependencies.
#
- readline-devel
- pkgconfig
- perl-ExtUtils-MakeMaker
- perl
- pam-devel
- openssl-devel
- numactl-devel
- nss-softokn-freebl
- ncurses-devel
- mysql-devel
- munge-libs
- munge-devel
- mariadb-devel
- man2html
- lua-devel
- hwloc-devel
- hdf5-devel
- blcr-devel
- blcr
#
# Ansible dependencies.
#
- python2-devel
- python-nose
- python-coverage
- python-mock
- python-boto3
- python-botocore
- python-passlib
- python2-sphinx-theme-alabaster
- pytest
#
# Lua, Lmod, EasyBuild dependencies.
#
- rdma-core-devel
- libxml2-devel

- name: Set lustre client source url.
set_fact:
lustre_rpm_url: https://downloads.whamcloud.com/public/lustre/lustre-2.10.4/el7/client/SRPMS
lustre_src_rpm_name: lustre-2.10.4-1.src.rpm
lustre_client_rpm_name: lustre-client-2.10.4-1.el7.x86_64.rpm

- name: check if the buildserver has already built the client.
stat:
path: /root/rpmbuild/RPMS/x86_64/{{ lustre_client_rpm_name }}
register: remote_file

- name: build the lustre client.
block:
- name: Fetch the lustre client source
get_url:
url: "{{ lustre_rpm_url }}/{{ lustre_src_rpm_name }}"
dest: /tmp/{{ lustre_src_rpm_name }}

- name: build the lustre client.
command: rpmbuild --rebuild --without servers /tmp/{{ lustre_src_rpm_name }}
become: true
when: remote_file.stat.exists == false

- name: Mount isilon apps
mount:
path: /apps
src: gcc-storage001.stor.hpc.local:/ifs/rekencluster/umcgst10/.envsync/tmp01
fstype: nfs
opts: defaults,_netdev,nolock,vers=4.0,noatime,nodiratime
state: present
Loading

0 comments on commit c2a8a21

Please sign in to comment.