Skip to content

Commit

Permalink
Update and fix documentation for website doc
Browse files Browse the repository at this point in the history
  • Loading branch information
ratnopamc committed Sep 20, 2024
1 parent 4136b09 commit 4bdef0a
Show file tree
Hide file tree
Showing 13 changed files with 45 additions and 52 deletions.
2 changes: 1 addition & 1 deletion ai-ml/jark-stack/terraform/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ module "eks_blueprints_addons" {
}
],
}

#---------------------------------------
# CloudWatch metrics for EKS
#---------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ resources:

# This toleration allows Daemonset pod to be scheduled on any node, regardless of their Taints.
tolerations:
- operator: Exists
- operator: Exists
6 changes: 3 additions & 3 deletions website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ you will see similar output like the following
It's time to test the Llama3 just deployed. First setup a simple environment for the testing.

```bash
cd gen-ai/inference/nvidia-nim/nim-client
cd data-on-eks/gen-ai/inference/nvidia-nim/nim-client
python3 -m venv .venv
source .venv/bin/activate
pip install openai
Expand Down Expand Up @@ -335,7 +335,7 @@ By applying these optimizations, TensorRT can significantly accelerate LLM infer
Deploy the [Open WebUI](https://github.com/open-webui/open-webui) by running the following command:

```sh
kubectl apply -f gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml
kubectl apply -f data-on-eks/gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml
```

**2. Port Forward to Access WebUI**
Expand Down Expand Up @@ -373,7 +373,7 @@ Enter your prompt, and you will see the streaming results, as shown below:
GenAI-Perf can be used as standard tool to benchmark with other models deployed with inference server. But this tool requires a GPU. To make it easier, we provide you a pre-configured manifest `genaiperf-deploy.yaml` to run the tool.

```bash
cd gen-ai/inference/nvidia-nim
cd data-on-eks/gen-ai/inference/nvidia-nim
kubectl apply -f genaiperf-deploy.yaml
```

Expand Down
8 changes: 4 additions & 4 deletions website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ aws eks --region us-west-2 update-kubeconfig --name jark-stack
**Deploy RayServe Cluster**

```bash
cd ./../gen-ai/inference/stable-diffusion-rayserve-gpu
cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu
kubectl apply -f ray-service-stablediffusion.yaml
```

Expand Down Expand Up @@ -198,7 +198,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
First, lets build the docker container for the client app.

```bash
cd ../gradio-ui
cd data-on-eks/gen-ai/inference/gradio-ui
docker build --platform=linux/amd64 \
-t gradio-app:sd \
--build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \
Expand Down Expand Up @@ -263,14 +263,14 @@ docker rmi gradio-app:sd
**Step2:** Delete Ray Cluster

```bash
cd ../stable-diffusion-rayserve-gpu
cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu
kubectl delete -f ray-service-stablediffusion.yaml
```

**Step3:** Cleanup the EKS Cluster
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ../../../ai-ml/jark-stack/
cd data-on-eks/ai-ml/jark-stack/
./cleanup.sh
```
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ kubectl -n triton-vllm port-forward svc/nvidia-triton-server-triton-inference-se
Next, run the Triton client for each model using the same prompts:

```bash
cd gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client
cd data-on-eks/gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client
python3 -m venv .venv
source .venv/bin/activate
pip install tritonclient[all]
Expand Down
2 changes: 1 addition & 1 deletion website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ You can test with your custom prompts by adding them to the `prompts.txt` file.
To run the Python client application in a virtual environment, follow these steps:

```bash
cd gen-ai/inference/vllm-rayserve-gpu
cd data-on-eks/gen-ai/inference/vllm-rayserve-gpu
python3 -m venv .venv
source .venv/bin/activate
pip install requests
Expand Down
10 changes: 5 additions & 5 deletions website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ To generate a token in HuggingFace, log in using your HuggingFace account and cl

:::

# Deploying Mistral-7B-Instruct-v0.2 on Inferentia2, Ray Serve, Gradio
# Serving Mistral-7B-Instruct-v0.2 using Inferentia2, Ray Serve, Gradio
This pattern outlines the deployment of the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model on Amazon EKS, utilizing [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for enhanced text generation performance. [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) ensures efficient scaling of Ray Worker nodes, while [Karpenter](https://karpenter.sh/) dynamically manages the provisioning of AWS Inferentia2 nodes. This setup optimizes for high-performance and cost-effective text generation applications in a scalable cloud environment.

Through this pattern, you will accomplish the following:
Expand Down Expand Up @@ -121,7 +121,7 @@ To deploy the Mistral-7B-Instruct-v0.2 model, it's essential to configure your H

export HUGGING_FACE_HUB_TOKEN=$(echo -n "Your-Hugging-Face-Hub-Token-Value" | base64)

cd ../../gen-ai/inference/mistral-7b-rayserve-inf2
cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
envsubst < ray-service-mistral.yaml| kubectl apply -f -
```

Expand Down Expand Up @@ -190,7 +190,7 @@ The following YAML script (`gen-ai/inference/mistral-7b-rayserve-inf2/gradio-ui.
To deploy this, execute:

```bash
cd gen-ai/inference/mistral-7b-rayserve-inf2/
cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2/
kubectl apply -f gradio-ui.yaml
```

Expand Down Expand Up @@ -242,7 +242,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
**Step1:** Delete Gradio App and mistral Inference deployment

```bash
cd gen-ai/inference/mistral-7b-rayserve-inf2
cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
kubectl delete -f gradio-ui.yaml
kubectl delete -f ray-service-mistral.yaml
```
Expand All @@ -251,6 +251,6 @@ kubectl delete -f ray-service-mistral.yaml
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ../../../ai-ml/trainium-inferentia/
cd data-on-eks/ai-ml/trainium-inferentia/
./cleanup.sh
```
12 changes: 6 additions & 6 deletions website/docs/gen-ai/inference/Neuron/llama2-inf2.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Llama-2 on Inferentia2
sidebar_position: 4
description: Deploy Llama-2 models on AWS Inferentia accelerators for efficient inference.
description: Serve Llama-2 models on AWS Inferentia accelerators for efficient inference.
---
import CollapsibleContent from '../../../../src/components/CollapsibleContent';

Expand All @@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab
:::


# Deploying Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio
# Serving Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio
Welcome to the comprehensive guide on deploying the [Meta Llama-2-13b chat](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
In this tutorial, you will not only learn how to harness the power of Llama-2, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`,
which are optimized for deploying and scaling large language models.
Expand Down Expand Up @@ -158,7 +158,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
**Deploy RayServe Cluster**

```bash
cd gen-ai/inference/llama2-13b-chat-rayserve-inf2
cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2
kubectl apply -f ray-service-llama2.yaml
```

Expand Down Expand Up @@ -282,7 +282,7 @@ The following YAML script (`gen-ai/inference/llama2-13b-chat-rayserve-inf2/gradi
To deploy this, execute:

```bash
cd gen-ai/inference/llama2-13b-chat-rayserve-inf2/
cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2/
kubectl apply -f gradio-ui.yaml
```

Expand Down Expand Up @@ -330,7 +330,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
**Step1:** Delete Gradio App and Llama2 Inference deployment

```bash
cd gen-ai/inference/llama2-13b-chat-rayserve-inf2
cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2
kubectl delete -f gradio-ui.yaml
kubectl delete -f ray-service-llama2.yaml
```
Expand All @@ -339,6 +339,6 @@ kubectl delete -f ray-service-llama2.yaml
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ai-ml/trainium-inferentia
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh
```
12 changes: 6 additions & 6 deletions website/docs/gen-ai/inference/Neuron/llama3-inf2.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Llama-3-8B on Inferentia2
sidebar_position: 3
description: Deploy Llama-3 models on AWS Inferentia accelerators for efficient inference.
description: Serve Llama-3 models on AWS Inferentia accelerators for efficient inference.
---
import CollapsibleContent from '../../../../src/components/CollapsibleContent';

Expand All @@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab
:::


# Deploying Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio
# Serving Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio

Welcome to the comprehensive guide on deploying the [Meta Llama-3-8B Instruct](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).

Expand Down Expand Up @@ -158,7 +158,7 @@ To deploy the llama3-8B-Instruct model, it's essential to configure your Hugging

export HUGGING_FACE_HUB_TOKEN=<Your-Hugging-Face-Hub-Token-Value>

cd ./../gen-ai/inference/llama3-8b-rayserve-inf2
cd data-on-eks/gen-ai/inference/llama3-8b-rayserve-inf2
envsubst < ray-service-llama3.yaml| kubectl apply -f -
```

Expand Down Expand Up @@ -244,7 +244,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
First, lets build the docker container for the client app.

```bash
cd ../gradio-ui
cd data-on-eks/gen-ai/inference/gradio-ui
docker build --platform=linux/amd64 \
-t gradio-app:llama \
--build-arg GRADIO_APP="gradio-app-llama.py" \
Expand Down Expand Up @@ -298,14 +298,14 @@ docker rmi gradio-app:llama
**Step2:** Delete Ray Cluster

```bash
cd ../llama3-8b-instruct-rayserve-inf2
cd data-on-eks/gen-ai/inference/llama3-8b-instruct-rayserve-inf2
kubectl delete -f ray-service-llama3.yaml
```

**Step3:** Cleanup the EKS Cluster
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ../../../ai-ml/trainium-inferentia/
cd data-on-eks/ai-ml/trainium-inferentia/
./cleanup.sh
```
8 changes: 4 additions & 4 deletions website/docs/gen-ai/inference/Neuron/rayserve-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ export TF_VAR_enable_rayserve_ha_elastic_cache_redis=true
Then, run the `install.sh` script to install the EKS cluster with KubeRay operator and other add-ons.

```bash
cd ai-ml/trainimum-inferentia
cd data-on-eks/ai-ml/trainimum-inferentia
./install.sh
```

Expand Down Expand Up @@ -135,7 +135,7 @@ With the above `RayService` configuration, we have enabled GCS fault tolerance f
Let's apply the above `RayService` configuration and check the behavior.

```bash
cd ../../gen-ai/inference/
cd data-on-eks/gen-ai/inference/
envsubst < mistral-7b-rayserve-inf2/ray-service-mistral-ft.yaml| kubectl apply -f -
```

Expand Down Expand Up @@ -202,7 +202,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
**Step1:** Delete Gradio App and mistral Inference deployment

```bash
cd gen-ai/inference/mistral-7b-rayserve-inf2
cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
kubectl delete -f gradio-ui.yaml
kubectl delete -f ray-service-mistral-ft.yaml
```
Expand All @@ -211,6 +211,6 @@ kubectl delete -f ray-service-mistral-ft.yaml
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ../../../ai-ml/trainium-inferentia/
cd data-on-eks/ai-ml/trainium-inferentia/
./cleanup.sh
```
10 changes: 5 additions & 5 deletions website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This example blueprint deploys a `stable-diffusion-xl-base-1-0` model on Inferen

:::

# Deploying Stable Diffusion XL Base Model with Inferentia, Ray Serve and Gradio
# Serving Stable Diffusion XL Base Model with Inferentia, Ray Serve and Gradio
Welcome to the comprehensive guide on deploying the [Stable Diffusion XL Base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
In this tutorial, you will not only learn how to harness the power of Stable Diffusion models, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`,
which are optimized for deploying and scaling large language models.
Expand Down Expand Up @@ -135,7 +135,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
**Deploy RayServe Cluster**

```bash
cd ../../gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
kubectl apply -f ray-service-stablediffusion.yaml
```

Expand Down Expand Up @@ -217,7 +217,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
First, lets build the docker container for the client app.

```bash
cd ../gradio-ui
cd data-on-eks/gen-ai/inference/gradio-ui
docker build --platform=linux/amd64 \
-t gradio-app:sd \
--build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \
Expand Down Expand Up @@ -276,14 +276,14 @@ docker rmi gradio-app:sd
**Step2:** Delete Ray Cluster

```bash
cd ../stable-diffusion-xl-base-rayserve-inf2
cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
kubectl delete -f ray-service-stablediffusion.yaml
```

**Step3:** Cleanup the EKS Cluster
This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.

```bash
cd ../../../ai-ml/trainium-inferentia/
cd data-on-eks/ai-ml/trainium-inferentia/
./cleanup.sh
```
13 changes: 3 additions & 10 deletions website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ Having deployed the EKS cluster with all the necessary components, we can now pr
This will apply the RayService configuration and deploy the cluster on your EKS setup.

```bash
cd ../../gen-ai/inference/vllm-rayserve-inf2
cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2

kubectl apply -f vllm-rayserve-deployment.yaml
```
Expand Down Expand Up @@ -258,7 +258,7 @@ kubectl -n vllm port-forward svc/vllm-llama3-inf2-serve-svc 8000:8000
To run the Python client application in a virtual environment, follow these steps:

```bash
cd gen-ai/inference/vllm-rayserve-inf2
cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2
python3 -m venv .venv
source .venv/bin/activate
pip3 install openai
Expand Down Expand Up @@ -588,13 +588,6 @@ Each of these files contain the following Performance Benchmarking Metrics:
```results_number_output_tokens_*```: Number of output tokens in the requests (Output length)
## Cleanup
To remove all resources created by this deployment, run:
```bash
./cleanup.sh
```
## Conclusion
In summary, when it comes to deploying and scaling Llama-3, AWS Trn1/Inf2 instances offer a compelling advantage.
Expand All @@ -615,7 +608,7 @@ kubectl delete -f vllm-rayserve-deployment.yaml
Destroy the EKS Cluster and resources
```bash
cd ../../../ai-ml/trainium-inferentia/
cd data-on-eks/ai-ml/trainium-inferentia/
./cleanup.sh
```
10 changes: 5 additions & 5 deletions website/docs/resources/binpacking-custom-scheduler-eks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ sidebar_label: Bin packing for Amazon EKS
In this post, we will show you how to enable a custom scheduler with Amazon EKS when running DoEKS especially for Spark on EKS, including OSS Spark and EMR on EKS. The custom scheduler is a custom Kubernetes scheduler with ```MostAllocated``` strategy running in data plane.

### Why bin packing
By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency.
By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency.

Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the
Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the

```MostAllocated``` in EMR on EKS.

Expand Down Expand Up @@ -71,12 +71,12 @@ spec:
volumes:
- name: spark-local-dir-1
hostPath:
path: /local1
initContainers:
path: /local1
initContainers:
- name: volume-permission
image: public.ecr.aws/docker/library/busybox
# grant volume access to hadoop user
command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1']
command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1']
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
Expand Down

0 comments on commit 4bdef0a

Please sign in to comment.