-
Notifications
You must be signed in to change notification settings - Fork 13
New Production with Multiple Instances and vGPU
As of July 2024, Compute Canada suggests that we migrate to vGPU instances and old hardware for GPU instances will come to the end of service soon. This page will include all the steps to set up (or reproduce) the current Rodan production server(s). Here is a summary of what we have now for rodan2.simssa.ca. Some reasoning behind this choice can be found in issue #1184.
- A manager instance on Ubuntu 20.04 with Docker version 24.0.2, build cb74dfc, with 8 vCPUs and 1 vGPU (driver 550, 16GiB GPU RAM) and 40 GiB instance RAM.
- A worker instance on Ubuntu 20.04 with Docker version 24.0.2, build cb74dfc, with 16 vCPUs and 16 GiB instance RAM.
Note: do not upgrade Docker to any newer versions unless we are sure later Docker Engine does not lead to DNS resolution issues on Ubuntu.
We distribute containers on two instances as follows using Docker swarm.
On manager instance:
- rodan_rodan-main
- rodan_celery
- rodan_gpu-celery
- rodan_nginx
- rodan_postgres
- rodan_py3-celery
- rodan_redis
On worker instance:
- rodan_rabbitmq
- rodan_iipsrv
- rodan_rodan-client
Ideally, we want to put py3-celery
on the worker instance at least. Although it is possible (and tested) with Debian 11 and 12, with Ubuntu 20.04, we have to put all those on the same instance to avoid redis
timeout issue. Given the current limit of 8 vCPUs on the manager instance, the performance will be improved greatly if we can fix this and move those containers to the worker instance.
At this point, our manager instance is boot from the old prod_Rodan2_GPU
disk with all the user data and resources, and therefore it is best practice to put postgres
on this manager instance as well. Two instances share the data via NFS.
Also, upon testing, the p
instance type (the worker instance) can easily be resized while retaining the same IP and Docker network.
We experienced a major server crash: the GPU driver mysteriously disappeared, and the Docker service consumed so much memory that it could neither be launched nor modified. Despite trying everything we could to rescue the server, nothing worked, and the instance continued to report out-of-memory kills for any process we attempted to run. In the end, we realized that the only solution was to deploy a new server.
However, new problems arose: while we could accomplish everything with Debian 11, we couldn't run PACO training using the GPU. On the other hand, when using Ubuntu 20.04, we were unable to deploy the Docker service.
Later, we discovered the root of the problem preventing us from launching a full Docker Swarm. When launching a new Arbutus instance with Ubuntu 20.04, the default Linux kernel is a KVM version (which you can verify by running uname -r
). This kernel is compact and optimized for virtual machines, but it does not include IPVS, which is necessary for virtual IP services. To use IPVS, a generic Linux kernel is required or we have to compile our own kernel.
While it is possible to directly install a new kernel and boot into it (with some complicated steps), doing so would cause another issue— the inability to properly use the NVIDIA GPU driver that comes with the vGPU instance.
To resolve this, the best approach is to start with the old Rodan volume that uses the old generic Linux kernel (or create a volume from a snapshot), boot it in another cloud environment (such as a persistent p
-flavor instance), upgrade to the desired Ubuntu version (currently 20.04), then delete the instance and reboot it as a vGPU instance. Now, if you SSH into this new instance and check the kernel, it will be the desired generic version. Installing the vGPU driver at this point will also install the necessary KVM kernel, thereby avoiding compatibility issues between the generic kernel and the vGPU driver, while keeping the default kernel as the generic version that includes IPVS.
Since this process is quite complex, we've saved multiple snapshots at each step for backup purposes.
Go to Arbutus openstack page, and click Launch Instance
. Here is the information to fill out the form.
- Details: Any reasonable name and description. Make sure
Availability Zone
isAny
. - Source: For manager, it is boot from volume (and therefore the OS depends on the volume). For worker, it is boot from image and we pick the same OS (Ubuntu 20.04 in this case) and create a volume (1500 or 2000 GiB is fine). Make sure
Delete Volume on Instance Delete
isFalse
for both worker and manager. - Flavor: As of July 2024 we use
g1-16gb-c8-40gb
for manager andp16-16gb
for worker. - Networks: Select
rpp-ichiro-network
. - Security Groups: Deselect
default
and selectprod-internal
. - Configuration: Upload cloud.init from the ansible repo.
- Metadata: Add "rodan" label so that the new instance can be automatically added to
os_service_rodan
group managed by ansible. (This can also be done later.) Don't do anything else.
After this, run ansible useradd
and adminadd
to be able to ssh to the new instance.
- Remove any existing Nvidia drivers.
sudo apt-get purge "*nvidia*"
- Follow the official guide from Compute Canada here according to the OS version.
- Install
nvidia-container-toolkit
. (Official websitehere)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
- Install container runtime.
sudo apt install nvidia-container-runtime
- Make sure to follow the Docker Guide for specific OS and install the exact version we want.
- Set up nvidia runtime for docker following guide here. Prereq: (1) NVIDIA Container Toolkit; (2) Docker.
Steps:
a. sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
b. sudo systemctl restart docker
c. run docker info
and verify docker runtime has nvidia
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Warning: steps from here are based on practice as there's no related official guide.
d. in /etc/docker/daemon.json
make sure it has full path like
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"args": []
}
},
"default-runtime": "nvidia"
}
e. restart daemon and docker
systemctl daemon-reload
systemctl restart docker
- Generate the key pair.
ssh-keygen -t rsa -b 4096 -C "[[email protected]](mailto:[email protected])"
We can name it rodan-docker
.
- Enter the public key (
~/.ssh/rodan-docker.pub
)in the github repo settings deploy keys with a name associated to the server. Make sure Allow write access is off. - Create a config file in the ssh folder
~/.ssh/
.
Host github.com
HostName github.com
User git
IdentityFile ~/.ssh/rodan-docker
- Test ssh.
ssh github.com
It should return
PTY allocation request failed on channel 0
Hi DDMAL/rodan-docker! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.
- Clone the Rodan repo.
cd /srv/webapps/
git clone --single-branch -b master [email protected]:DDMAL/Rodan.git
Make sure to double check the branch.
- Modify
scripts/production.env
to have all the credentials. - Modify
rodan-client/config/configuration.json
to have 443 portTrue
:
"SERVER_HOST": "rodan2.simssa.ca",
"SERVER_PORT": "443",
"SERVER_HTTPS": true,
- Adjust
production.yml
and make sure we have reasonable resource allocation for each container. Here is the one that works for the current specific settings.
version: "3.4"
services:
nginx:
image: "ddmal/nginx:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "0.25"
memory: 0.5G
limits:
cpus: "0.25"
memory: 0.5G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "/usr/sbin/service", "nginx", "status"]
interval: "30s"
timeout: "10s"
retries: 10
start_period: "5m"
command: /run/start
environment:
TZ: America/Toronto
SERVER_HOST: rodan2.simssa.ca
TLS: 1
ports:
- "80:80"
- "443:443"
- "5671:5671"
- "9002:9002"
volumes:
- "resources:/rodan/data"
rodan-main:
image: "ddmal/rodan-main:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "0.8"
memory: 3G
limits:
cpus: "0.8"
memory: 3G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
interval: "30s"
timeout: "30s"
retries: 5
start_period: "2m"
command: /run/start
environment:
TZ: America/Toronto
SERVER_HOST: rodan2.simssa.ca
CELERY_JOB_QUEUE: None
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
rodan-client:
image: "ddmal/rodan-client:nightly"
deploy:
placement:
constraints:
- node.role == worker
volumes:
- "./rodan-client/config/configuration.json:/client/configuration.json"
iipsrv:
image: "ddmal/iipsrv:nightly"
volumes:
- "resources:/rodan/data"
celery:
image: "ddmal/rodan-main:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "0.8"
memory: 4G
limits:
cpus: "0.8"
memory: 4G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
interval: "30s"
timeout: "30s"
start_period: "1m"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan2.simssa.ca
CELERY_JOB_QUEUE: celery
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
py3-celery:
image: "ddmal/rodan-python3-celery:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "3"
memory: 6G
limits:
cpus: "3"
memory: 6G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
interval: "30s"
timeout: "30s"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan2.simssa.ca
CELERY_JOB_QUEUE: Python3
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
gpu-celery:
image: "ddmal/rodan-gpu-celery:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "1"
memory: 18G
limits:
cpus: "1"
memory: 18G
placement:
constraints:
- node.role == manager
restart_policy:
condition: any
delay: 5s
window: 30s
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
interval: "30s"
timeout: "30s"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan2.simssa.ca
CELERY_JOB_QUEUE: GPU
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
redis:
image: "redis:alpine"
deploy:
replicas: 1
resources:
reservations:
cpus: "1"
memory: 2G
limits:
cpus: "1"
memory: 2G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
environment:
TZ: America/Toronto
postgres:
image: "ddmal/postgres-plpython:v3.0.0"
deploy:
replicas: 1
endpoint_mode: dnsrr
resources:
reservations:
cpus: "1"
memory: 2G
limits:
cpus: "1"
memory: 2G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
interval: 10s
timeout: 5s
retries: 5
environment:
TZ: America/Toronto
volumes:
- "pg_data:/var/lib/postgresql/data"
- "pg_backup:/backups"
env_file:
- ./scripts/production.env
rabbitmq:
image: "rabbitmq:alpine"
deploy:
replicas: 1
resources:
reservations:
cpus: "3"
memory: 4G
limits:
cpus: "3"
memory: 4G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == worker
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
interval: "30s"
timeout: "3s"
retries: 3
environment:
TZ: America/Toronto
env_file:
- ./scripts/production.env
volumes:
resources:
pg_backup:
pg_data:
NFS allows both instances to access the same directory and sync. Steps below are mostly from this guide. Also check this Ubuntu guide.
- Install NFS packages. On manager (or the instance with the Rodan data)
sudo apt update
sudo apt install nfs-kernel-server
On worker
sudo apt update
sudo apt install nfs-common
- Change ownership of directory on manager.
sudo chown nobody:nogroup /var/lib/docker/volumes/
- Configure NFS exports on manager.
sudo nano /etc/exports
and add this line: /var/lib/docker/volumes [IP address for worker instance, usually 192.168.xx.xx](rw,sync,no_subtree_check,no_root_squash)
- Restart NFS on manager.
sudo systemctl restart nfs-kernel-server
- Mount NFS on worker.
Make sure you purge everything in
/var/lib/docker/volumes/
so that it is an empty directory.
sudo mount [manager IP]:/var/lib/docker/volumes/ /var/lib/docker/volumes/
Check if it has been correctly connected by df -h
.
- (Optional but recommended) Mount the NFS directory at boot on worker.
sudo nano /etc/fstab
and add this line: [manager IP]:/var/lib/docker/volumes /var/lib/docker/volumes nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
This is usually done by ansible /playbooks/nginxconf.yml
after the current manager (or the instance running Nginx
) IP has been updated in /playbooks/vars/simssa.ca.yml
under rodan2
block.
(with sudo -i
on both instances)
- On manager
docker swarm init
and you will see a command for worker join token.
If swarm is already running, then run docker swarm join-token worker
.
-
On worker, run the command generated in the previous step.
-
On manager, verify there are two nodes by
docker node ls
. -
Start Rodan.
make pull_prod
make deploy_production
-
Verify Rodan service is correctly running by
docker service ls
on manager anddocker ps -a
on both instances. Sometimesrodan_main
will fail when the stack is just launched but docker swarm will successfully reproduce it later when other containers are ready. -
Some debugging commands that might be helpful:
docker info
docker service logs [service id]
docker service ps [service id] --no-trunc
docker logs [container id]
docker exec -it [container id] [bash or sh]
- Some useful commands to run from
/srv/webapps/Rodan
on instance what runs the corresponding container, which can be found in Makefile:
make gpu-celery_log
make py3-celery_log
make celery_log
make rodan-main_log
We might consider hosting data on a separate instance so that we do not have to stick with Ubuntu 20.04 and fit all big containers in the manager instance.
Also, to upgrade OS, if the nova cloud (for all GPU related instances) repo does not provide upgrade option, it is possible to delete the instance, boot the same volume as a regular p
instance and do the OS upgrade in the persistent cloud. After the volume has been upgraded to the desired the newer OS version, we can delete the instance and boot a new vGPU instance from the same volume.
Be sure to search old issues and PRs for more notes.
We have not implemented the auto upgrade but instructions are here.
- Repository Structure
- Working on Rodan
- Testing Production Locally
- Working on Interactive Classifier
- Job Queues
- Testing New Docker Images
- Set up Environment Variables
- Set up SSL with Certbot
- Set up SSH with GitHub
- Deploying on Staging
- Deploying on Production
- Import Previous Data