Skip to content

Latest commit

 

History

History
251 lines (166 loc) · 14 KB

README.adoc

File metadata and controls

251 lines (166 loc) · 14 KB

Red Hat OpenShift AI

Red Hat OpenShift AI (RHOAI) builds on the capabilities of Red Hat OpenShift to provide a single, consistent, enterprise-ready hybrid AI and MLOps platform. It provides tools across the full lifecycle of AI/ML experiments and models including training, serving, monitoring, and managing AI/ML models and AI-enabled applications. This is my personal repository to test and play with some of its most important features.

1. Red Hat Training

RHOAI is a product under continuous improvement, so this repo will be outdated at some point in time. I recommend you to refer to the Official documentation to check the latest features or you can try the official trainings.

Red Hat OpenShift AI (RHOAI) is a platform for data scientists, AI practitioners, developers, machine learning engineers, and operations teams to prototype, build, deploy, and monitor AI models. This is a wide variety of audience that needs different kinds of training. For that reason, there are several courses that will help you to understand RHOAI from all angles:

  • AI262 - Introduction to Red Hat OpenShift AI: About configuring Data Science Projects and Jupyter Notebooks.

  • AI263 - Red Hat OpenShift AI Administration: About installing RHOAI, configuring users and permissions and creating Custom Notebook Images.

  • AI264 - Creating Machine Learning Models with Red Hat OpenShift AI: About training models and enhancing the model training.

  • AI265 - Deploying Machine Learning Models with Red Hat OpenShift AI: About serving models on RHOAI.

  • AI266 - Automating AI/ML Workflows with Red Hat OpenShift AI: About creating Data Science Pipelines, and Elyra and Kubeflow Pipelines.

  • AI267 - Developing and Deploying AI/ML Applications on Red Hat OpenShift AI: All the previous courses altogether.

2. RHOAI Architecture

The following diagram depicts the general architecture of a RHOAI deployment, including the most important components:

RHOAI Architecture
Figure 1. RHOAI Architecture
  • codeflare: Codeflare is an IBM software stack for developing and scaling machine-learning and Python workloads. It uses and needs the Ray component.

  • dashboard: Provides the RHOAI dashboard.

  • datasciencepipelines: This enables you to build portable machine learning workflows. Requires the OpenShift Pipelines Operator to be present before enabling the data science pipelines.

  • kserve: RHOAI uses Kserve to serve large language models that can scale based on demand. Requires the OpenShift Serverless and the OpenShift Service Mesh operators to be present before enabling the component. Does not support enabled ModelMeshServing at the same time.

  • kueue: Kueue component configuration. It is not yet in Technology Preview

  • modelmeshserving: KServe also offers a component for general-purpose model serving, called ModelMesh Mesh Serving. Activate this component to serve small and medium size models. Does not support enabled Kserve at the same time.

  • ray: Component to run the data science code in a distributed manner.

  • workbenches: Workbenches are containerized and isolated working environments for data scientists to examine data and work with data models. Data scientists can create workbenches from an existing notebook container image to access its resources and properties. Workbenches are associated to container storage to prevent data loss when the workbench container is restarted or deleted.

3. Installation

Installing RHOAI is not as simple as installing and configuring other operators on OpenShift. This product provides integration with hardware like NVIDIA and Intel GPUs, automation of ML workflows and AI training, and deployment of LLMs. For that reason, I’ve created an auto-install.sh script that will do everything for you:

  1. If the installation is IPI AWS, it will create MachineSets for nodes with NVIDIA GPUs (Currently, g5.4xlarge).

  2. Install all the operators that RHOAI depends on:

    • Service Mesh and Serverless to enable KServe and allow Single-Model serving platform.

    • Node Feature Discovery and Nvidia GPU Operator to discover and configure nodes with GPU.

    • Authorino, to enable token authorization for models deployed with RHOAI.

  3. Install and configure OpenShift Data Foundation (ODF) in Multicloud Object Gateway (MCG) mode. This is a lightweight alternative that allows us to use the AWS S3 object storage the same way that we will then use Object storage on Baremetal using ODF.

  4. Installs the actual RHOAI operator and configures the installation with some defaults, enabling NVIDIA acceleration and Single-Model Serving.

  5. Deploys a new Data Science Project called RHOAI Playground enabling pipelines and deploying a basic Notebook for testing.

💡

💡 Tip 💡 The script contains many tasks divided in clear blocks with comments. Use the Environment Variables or add comments to disable those that you are not interested in.

In order to automate it all, it relays on OpenShift GitOps (ArgoCD), so you will to have it installed before executing the following script. Check out my automated installation on alvarolop/ocp-gitops-playground GitHub repository.

Now, log in to the cluster and just execute the script:

./auto-install.sh

4. Things you should know!

4.1. NVIDIA GPU nodes

Most of the activities related to RHOAI will require GPU Acceleration. For that purpose, we add NVIDIA GPU nodes during the installation process. In this chapter, I collect some information that might be useful for you.

In this automation, we are currently using the AWS g5.2xlarge instance, that according to the documentation:

Amazon EC2 G5 instances are designed to accelerate graphics-intensive applications and machine learning inference. They can also be used to train simple to moderately complex machine learning models.

How to know that a node has NVIDIA GPUs using NodeFeatureDiscovery?

The output of the following command will only be visible when you have applied the ArgoCD Application and the Node Feature Discovery operator has scanned the OpenShift nodes:

oc describe node | egrep 'Roles|pci'
Roles:              control-plane,master
Roles:              worker
                    feature.node.kubernetes.io/pci-1d0f.present=true
Roles:              gpu-worker,worker
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
Roles:              control-plane,master
Roles:              control-plane,master

pci-10de is the PCI vendor ID that is assigned to NVIDIA.

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.

After configuring the Node Feature Discovery Operator and the NVidia GPU Operator using GitOps, you need to confirm that the Nvidia operator is correctly retrieving the GPU information. You can use the following command to confirm that OpenShift is correctly configured:

oc exec -it -n nvidia-gpu-operator $(oc get pod -o wide -l openshift.driver-toolkit=true -o jsonpath="{.items[0].metadata.name}" -n nvidia-gpu-operator) -- nvidia-smi

The output should look like this:

Sat Oct 26 08:47:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8             22W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

If, for some race condition, RHOAI is not detecting that GPU worker, you might need to force it to recalculate. You can do so easily with the following command:

oc delete cm migration-gpu-status -n redhat-ods-applications; sleep 3; oc delete pods -l app=rhods-dashboard -n redhat-ods-applications

Wait for a few seconds until the dashboard pods start again and you will see in the RHOAI web console that now the NVidia GPU Accelerator Profile is listed.

4.2. Data Connection Pipelines S3 Bucket Secret

The DataSciencePipelineApplication requires an S3-compatible storage solution to store artifacts that are generated in the pipeline. You can use any S3-compatible storage solution for data science pipelines, including AWS S3, OpenShift Data Foundation, or MinIO. The automation is currently using ODF with Nooba to interact with the AWS S3 interface, so you won’t need to do anything. Nevertheless, if you decide to disable ODF, you will need to create buckets on AWS S3 manually and for that you will need the following process:

  1. Define the configuration variables for AWS is a file dubbed aws-env-vars. You can use the same structure as in aws-env-vars.example

  2. Execute the following command to interact with the AWS API:

    ./prerequisites/s3-bucket/create-aws-s3-bucket.sh

4.3. Reusing Router Certificates

By default, the Single Stack Serving in Openshift AI uses a self-signed certificate generated at installation for the endpoints that are created when deploying a server. This can be counter-intuitive because if you already have certificates configured on your OpenShift cluster, they will be used by default for other types of endpoints like Routes.

This following procedure explains how to use the same certificate that you already have for your OpenShift cluster.

export INGRESS_SECRET_NAME=$(oc get ingresscontroller default -n openshift-ingress-operator -o json | jq -r .spec.defaultCertificate.name)
oc get secret ${INGRESS_SECRET_NAME} -n openshift-ingress -o yaml | yq 'del(.metadata["namespace","creationTimestamp","resourceVersion","uid"])' | yq '.metadata.name = "rhods-internal-primary-cert-bundle-secret"' > rhods-internal-primary-cert-bundle-secret.yaml
oc apply -n istio-system -f rhods-internal-primary-cert-bundle-secret.yaml