This project offers a Kubeflow distribution that has the following characteristics:
- It follows a fully declarative, GitOps approach using ArgoCD. No other middleware is injected. All manifests are defined either as vanilla Kubernetes YAML specs, Kustomize specs, or Helm charts.
- Maximum integration with AWS managed services. We offload as much as possible to AWS, including database and artifact storage, identity management, load balancing, network routing and more! See below for a full listing of the currently supported AWS managed services
- A very simple init script and accompanying config file. We have intentionally kept this a simple "find-and-replace" script (in favour using using a stricter approach, such as encoding the entire distribution as a Helm chart) in order to make the repo easy to extend.
- Currently we offer "Kubeflow 1.3", but with a few differences. We do not in all places follow the official Kubeflow manifests, preferring instead to follow directly the (often much more recent) upstream distributions for both the core Kubeflow components (such as Pipelines, KFServing etc.) and free-standing applications (such as Istio, Certificate Manager, KNative etc.). As of the time of writing this, the difference is small, but over time (and until Kubeflow cuts its 1.4 release) it will become more significant.
- One particular area where we have chosen a fundamentally different approach relates to authentication and authorization. We have replaced the oidc-authservice entirely, preferring instead to use oauth2-proxy due to its wide adoption and active user base.
- Lastly, our interpretation of Kubeflow is that of an open and configurable ecosystem that can be easily extended with other services. As such, we also offer optional integrations with applications that are not part of the official Kubeflow distribution (such as MLFlow for example)
This distribution assumes that you will be making use of the following AWS services:
- An EKS Kubernetes cluster
- Autoscaling Groups as Worker Nodes in the EKS cluster. We use the cluster-autoscaler application to automatically scales nodes up or down depening on usage.
- An RDS instance, with security group and VPC configuration that allows it to be accessed from the Worker Nodes in the EKS cluster. We authenticate to the RDS database using classical username / password credentials.
- S3 Bucket(s) for Pipeline and (optionally) MLFlow artifact storage.
- An Elasticache Redis instance for storing cookies during the OIDC authentication process.
- A Network Load Balancer via external ingress/egress is facilitated. We use the aws-load-balancer-controller application in order to automatically provision NLB's in the correct subnets.
- Route53 for DNS routing. We use the external-dns application to automatically create records sets in Route53 in order to route from a public DNS to the NLB endpoint, as well as a LetsEncrypt DNS-01 solver to certify the domain with Route53.
- AWS Secrets Manager for storing sensitive data, such as various types of credentials. We use the external-secrets application to fetch these secrets into the Kubernetes cluster, allowing us to define in Git only the location where the secrets are to be found, as well as the ServiceAccount to use in order to find them.
- IAM Roles for Service Accounts (IRSA) to define the IAM Roles that may be assumed by specific Pods, by attaching a specific ServiceAccount to them. For example, we attach to the
external-dns
Pod a ServiceAccount that uses an IAM Role allowing certain actions in Route53. See the section below for a detailed listing of IRSA policies that are needed. - IAM Users. As far as possible, we try to avoid relying on IAM Users with static credentials, but there are certain cases where IRSA is currently not supported by the underlying Kubeflow applications. This includes Kubeflow Pipelines (for S3 artifact storage) and KFServing (for serving models directly from S3).
In the future we may develop overlays that would make some of these services optional, but for the current release if you wish to take them out this needs to be done after forking the repo.
Below you will find all of the IAM Policies that need to be attached to the IRSA roles. Before looking at the policies though, please take note of the fact that IRSA works via setting up a Trust relationship to a specific ServiceAccount in a specific Namespace. If you find that an IAM role is not being correctly assumed, it probably means that you are attaching it to a ServiceAccount that hasn't explicitly been authorized to do so.
Let's take the external-dns service as an example. The ServiceAccount for this application is defined here, is named external-dns
and is rolled out in the kube-system
Namespace. To allow this ServiceAccount to assume an IAM Role, we have to set a trust relationship that looks as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/SOMEUNIQUEID1234567890"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.eu-central-1.amazonaws.com/id/SOMEUNIQUEID1234567890:sub": "system:serviceaccount:kube-system:external-dns"
}
}
}
]
}
For every IRSA Role you set up, you will need a trust relationship such as the one above (substituting of course for the actual oidc provider url) and setting values "kube-system" and "external-dns" in system:serviceaccount:kube-system:external-dns
for appropriate for the Namespace and ServiceAccount names respectively.
Further down in this guide we explain how to initialise this repository. For now, just take note that we use placeholder values such as <<__role_arn.external_dns__>>
that will be replaced by the actual ARNs of the roles you wish to use. Below is a listing of all of the IRSA roles in use in this repository, along with links to JSON files with example policies. If you do a search on the whole "distribution" folder you find exactly where these placeholders are used.
Needs policies that allows it to provision a NLB in specific subnets.
- Placeholder:
<<__role_arn.aws_load_balancer_controller__>>
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-load-balancer-controller
- Policy: link
Needs policies that allows it to automatically scale EC2 instances up/down.
- Placeholder:
<<__role_arn.cluster_autoscaler__>>
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-cluster-autoscaler
- Policy: link
Needs policies that allows it to automatically create record sets in Route53.
- Placeholder:
<<__role_arn.external_dns__>>
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_kube-system_external-dns
- Policy: link
Needs policies that allows it to automatically create entries in Route53 in order to allow for DNS-01 solving.
- Placeholder:
<<__role_arn.cert_manager__>>
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_cert-manager_cert-manager
- Policy: link
The external-secrets application is a middleman that will create ExternalSecret custom resources in specific namespaces. It can be configured in two ways.
Option 1: Allow the external-secret application broad authority to read and write AWS secrets
Option 2: Allow the external-secret application to assume roles that have more narrowly defined
- Placeholder:
<<__role_arn.external_secrets>>
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_kube-system_external_secrets
- Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/my-cluster_kube-system_external-secrets"
},
"Action": "sts:AssumeRole"
}
]
}
In addition, we need to grant each role limited access to secrets. We have chosen an approach of limiting access to secrets by namespace, but it is possible to make this more granular if desired.
- Placeholder:
<<__role_arn.external_secrets.argocd__>>=
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_argocd
- Policy: link
- Placeholder:
<<__role_arn.external_secrets.kubeflow__>>=
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_kubeflow
- Policy: link
- Placeholder:
<<__role_arn.external_secrets.mlflow__>>=
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_mlflow
- Policy: link
- Placeholder:
<<__role_arn.external_secrets.auth__>>=
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_auth
- Policy: link
- Placeholder:
<<__role_arn.external_secrets.auth__>>=
- Example ARN:
arn:aws:iam::123456789012:role/my-cluster_istio-system
- Policy: link
There are two supported AWS backend types:
- Secrets Manager is the default type set in setup.conf.
<<__external_secrets.backend_type__>>=secretsManager
- System Manager can be used instead by updating the following placeholder value in setup.conf to
systemManager
.<<__external_secrets.backend_type__>>=systemManager
Unfortunately at the moment it is not possible to use IRSA in conjunction with Kubeflow Pipelines, which currently uses both the Minio Go and JavaScript clients. On both of those, additional work is needed to enable IRSA. Please see this tracking issue: kubeflow/pipelines#3405
For now, we use an IAM User in order to facilitate writing Pipeline artifacts to S3. The user's credentials are fetched from the AWS Secret Manager using and ExternalSecret. The relevant details for the IAM User are as follows
- IAM Accesskey Placeholder:
<<__external_secret_name.kubeflow.s3_accesskey__>>
- IAM SecretKey Placeholder:
<<__external_secret_name.kubeflow.s3_secretkey__>>
- Policy: link
This repository contains Kustomize manifests that point to the upstream manifest of each Kubeflow component and provides an easy way for people to change their deployment according to their need. ArgoCD application manifests for each componenet will be used to deploy Kubeflow. The intended usage is for people to fork this repository, make their desired kustomizations, run a script to change the ArgoCD application specs to point to their fork of this repository, and finally apply a master ArgoCD application that will deploy all other applications.
Mandatory:
Optional (if using setup_credentials.sh to generate initial credentials as sealed secrets):
- yq
- python 3.6 or newer
- kubeseal
- Python libraries:
- passlib
This repository uses a very simple initialisation script, ./setup_repo.sh that takes a config file such as the example one, ./examples/setup.conf and iterates over all lines therein. A single line would for example look as follows:
<<__role_arn.cluster_autoscaler__>>=arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-cluster-autoscaler
The init script will look for all occurences in the ./distribution folder of the placeholder <<__role_arn.cluster_autoscaler__>>
and will replace it with the value arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-cluster-autoscaler
. Please note that that comments (//
, #
), quatation marks ("
, '
) or unnecessary line-breaks should be avoided.
You may add any additional placeholder/value pairs you want. The naming convention <<__...__>>
has no functional purpose other than to aid readability and minimise the risk of a "find-and-replace" being performed on a value that was not meant as a placeholder.
Finally, if you wish you can use the "setup_credentials.sh" script to generate SealedSecrets that will be used for access to "admin" applications, such as the ArgoCD dashboard (in the future), Grafana, Dex, Keycloak, the kubeflow admin user etc. This script will generate various random credentials and create a "sealed" representation that is safe to declare in your Git repository.
Run the following commands to install the kubeseal CLI on Linux:
wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.16.0/kubeseal-linux-amd64 -O kubeseal
sudo install -m 755 kubeseal /usr/local/bin/kubeseal
On mac you can use Brew to install the kubeseal CLI:
brew install kubeseal
Next, ensure passlib is installed:
pip install passlib
Deploy the Sealed Secrets controller to the cluster:
kubectl apply -f distribution/argocd-applications/sealed-secrets.yaml
Finally, the script can be run with:
./setup_credentials.sh --email [email protected] --username youruser --firstname Yourname --lastname Yoursurname --password yourpassword
You may leave out any of the input paramaters. In that case, a default value (or generated value in the case of passwords) will be used. Alternatively, environmnet variables can be used instead of input parameters.
To initialise your repository, do the following:
- fork this repo
- modify the kustomizations for your purpose. You may in particular wish to edit
distribution/kubeflow.yaml
with the selection of applications you wish to roll out - set up a "setup.conf" file (or do a manual "find-and-replace" if you prefer) such as this one in the root of the repository
- run
./setup_repo.sh setup.conf
- (optionally) run
./setup_credentials.sh --email [email protected] --username youruser --firstname Yourname --lastname Yoursurname --password yourpassword
- commit and push your changes
Start up external-secret:
kustomize build distribution/external-secrets/ | kubectl apply -f -
Start up argocd:
-
If you are using a public repo:
kustomize build distribution/argocd/base/ | kubectl apply -f -
-
If you are using a private repo (not that this will use an ExternalSecret to fetch git credentials from the AWS Secret Manager):
kustomize build distribution/argocd/overlays/private-repo/ | kubectl apply -f -
Finally, roll out Kubeflow with:
kubectl apply -f distribution/kubeflow.yaml
If you wish, you may also set up ArgoCD to manage itself, as follows:
kubectl apply -f distribution/argocd-applications/argocd.yaml
To customize the list of images presented in the Jupyter Web App and other related setting such as allowing custom images, edit the spawner_ui_config.yaml file.
A large problem for many people is how to easily upload or download data to and from the PVCs mounted as their workspace volumes for Notebook Servers. To make this easier a simple PVCViewer Controller was created (a slightly modified version of the tensorboard-controller). This feature was not ready in time for 1.3, and thus I am only documenting it here as an experimental feature as I believe many people would like to have this functionality. The images are grabbed from my personal dockerhub profile, but I can provide instructions for people that would like to build the images themselves. Also, it is important to note that the PVC Viewer will work with ReadWriteOnce PVCs, even when they are mounted to an active Notebook Server.
Here is an example of the PVC Viewer in action:
To use the PVCViewer Controller, it must be deployed along with an updated version of the Volumes Web App. To do so, deploy experimental-pvcviewer-controller.yaml and experimental-volumes-web-app.yaml instead of the regular Volumes Web App. If you are deploying Kubeflow with the kubeflow.yaml file, you can edit the root kustomization.yaml and comment out the regular Volumes Web App and uncomment the PVCViewer Controller and Experimental Volumes Web App.
By default, all the ArgoCD application specs included here are setup to automatically sync with the specified repoURL. If you would like to change something about your deployment, simply make the change, commit it and push it to your fork of this repo. ArgoCD will automatically detect the changes and update the necessary resources in your cluster.
By default the ArgoCD UI is rolled out behind a ClusterIP. This can be accessed for development purposes with port forwarding, for example:
kubectl port-forward svc/argocd-server -n argocd 8888:80
The UI will now be accessible at localhost:8888
and can be accessed with the initial admin password. The password is stored in a secret and can be read as follows:
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
If you wish to update the password, this can be done using the argcd cli, using the following commands:
argocd login localhost:8888
argocd account update-password
Before contributing, please install pre-commit and initialise .pre-commit-config.yaml
by running the following from the repo's root directory:
pre-commit install
Please feel free to add features by forking this repo, developing and testing your feature and merging back to master via a Pull Request. We are currently still a small community, but feel free to also report bugs or make issue requests on the issue board!