Update and fix documentation for website doc

awslabs · Sep 20, 2024 · 4bdef0a · 4bdef0a
1 parent 4136b09
commit 4bdef0a
Show file tree

Hide file tree

Showing 13 changed files with 45 additions and 52 deletions.
diff --git a/ai-ml/jark-stack/terraform/addons.tf b/ai-ml/jark-stack/terraform/addons.tf
@@ -167,7 +167,7 @@ module "eks_blueprints_addons" {
       }
     ],
   }
-  
+
   #---------------------------------------
   # CloudWatch metrics for EKS
   #---------------------------------------

diff --git a/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml b/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml
@@ -8,4 +8,4 @@ resources:
 
 # This toleration allows Daemonset pod to be scheduled on any node, regardless of their Taints.
 tolerations:
-  - operator: Exists
+  - operator: Exists
diff --git a/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md b/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md
@@ -245,7 +245,7 @@ you will see similar output like the following
 It's time to test the Llama3 just deployed. First setup a simple environment for the testing.
 
 ```bash
-cd gen-ai/inference/nvidia-nim/nim-client
+cd data-on-eks/gen-ai/inference/nvidia-nim/nim-client
 python3 -m venv .venv
 source .venv/bin/activate
 pip install openai
@@ -335,7 +335,7 @@ By applying these optimizations, TensorRT can significantly accelerate LLM infer
 Deploy the [Open WebUI](https://github.com/open-webui/open-webui) by running the following command:
 
 ```sh
-kubectl apply -f gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml
+kubectl apply -f data-on-eks/gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml
 ```
 
 **2. Port Forward to Access WebUI**
@@ -373,7 +373,7 @@ Enter your prompt, and you will see the streaming results, as shown below:
 GenAI-Perf can be used as standard tool to benchmark with other models deployed with inference server. But this tool requires a GPU. To make it easier, we provide you a pre-configured manifest `genaiperf-deploy.yaml` to run the tool.
 
 ```bash
-cd gen-ai/inference/nvidia-nim
+cd data-on-eks/gen-ai/inference/nvidia-nim
 kubectl apply -f genaiperf-deploy.yaml
 ```
 

diff --git a/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md b/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md
@@ -121,7 +121,7 @@ aws eks --region us-west-2 update-kubeconfig --name jark-stack
 **Deploy RayServe Cluster**
 
 ```bash
-cd ./../gen-ai/inference/stable-diffusion-rayserve-gpu
+cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu
 kubectl apply -f ray-service-stablediffusion.yaml
 ```
 
@@ -198,7 +198,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
 First, lets build the docker container for the client app.
 
 ```bash
-cd ../gradio-ui
+cd data-on-eks/gen-ai/inference/gradio-ui
 docker build --platform=linux/amd64 \
     -t gradio-app:sd \
     --build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \
@@ -263,14 +263,14 @@ docker rmi gradio-app:sd
 **Step2:** Delete Ray Cluster
 
 ```bash
-cd ../stable-diffusion-rayserve-gpu
+cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu
 kubectl delete -f ray-service-stablediffusion.yaml
 ```
 
 **Step3:** Cleanup the EKS Cluster
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ../../../ai-ml/jark-stack/
+cd data-on-eks/ai-ml/jark-stack/
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md b/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md
@@ -333,7 +333,7 @@ kubectl -n triton-vllm port-forward svc/nvidia-triton-server-triton-inference-se
 Next, run the Triton client for each model using the same prompts:
 
 ```bash
-cd gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client
+cd data-on-eks/gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client
 python3 -m venv .venv
 source .venv/bin/activate
 pip install tritonclient[all]

diff --git a/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md b/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md
@@ -238,7 +238,7 @@ You can test with your custom prompts by adding them to the `prompts.txt` file.
 To run the Python client application in a virtual environment, follow these steps:
 
 ```bash
-cd gen-ai/inference/vllm-rayserve-gpu
+cd data-on-eks/gen-ai/inference/vllm-rayserve-gpu
 python3 -m venv .venv
 source .venv/bin/activate
 pip install requests

diff --git a/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md b/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md
@@ -15,7 +15,7 @@ To generate a token in HuggingFace, log in using your HuggingFace account and cl
 
 :::
 
-# Deploying Mistral-7B-Instruct-v0.2 on Inferentia2, Ray Serve, Gradio
+# Serving Mistral-7B-Instruct-v0.2 using Inferentia2, Ray Serve, Gradio
 This pattern outlines the deployment of the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model on Amazon EKS, utilizing [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for enhanced text generation performance. [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) ensures efficient scaling of Ray Worker nodes, while [Karpenter](https://karpenter.sh/) dynamically manages the provisioning of AWS Inferentia2 nodes. This setup optimizes for high-performance and cost-effective text generation applications in a scalable cloud environment.
 
 Through this pattern, you will accomplish the following:
@@ -121,7 +121,7 @@ To deploy the Mistral-7B-Instruct-v0.2 model, it's essential to configure your H
 
 export HUGGING_FACE_HUB_TOKEN=$(echo -n "Your-Hugging-Face-Hub-Token-Value" | base64)
 
-cd ../../gen-ai/inference/mistral-7b-rayserve-inf2
+cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
 envsubst < ray-service-mistral.yaml| kubectl apply -f -
 ```
 
@@ -190,7 +190,7 @@ The following YAML script (`gen-ai/inference/mistral-7b-rayserve-inf2/gradio-ui.
 To deploy this, execute:
 
 ```bash
-cd gen-ai/inference/mistral-7b-rayserve-inf2/
+cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2/
 kubectl apply -f gradio-ui.yaml
 ```
 
@@ -242,7 +242,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
 **Step1:** Delete Gradio App and mistral Inference deployment
 
 ```bash
-cd gen-ai/inference/mistral-7b-rayserve-inf2
+cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
 kubectl delete -f gradio-ui.yaml
 kubectl delete -f ray-service-mistral.yaml
 ```
@@ -251,6 +251,6 @@ kubectl delete -f ray-service-mistral.yaml
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ../../../ai-ml/trainium-inferentia/
+cd data-on-eks/ai-ml/trainium-inferentia/
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/Neuron/llama2-inf2.md b/website/docs/gen-ai/inference/Neuron/llama2-inf2.md
@@ -1,7 +1,7 @@
 ---
 title: Llama-2 on Inferentia2
 sidebar_position: 4
-description: Deploy Llama-2 models on AWS Inferentia accelerators for efficient inference.
+description: Serve Llama-2 models on AWS Inferentia accelerators for efficient inference.
 ---
 import CollapsibleContent from '../../../../src/components/CollapsibleContent';
 
@@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab
 :::
 
 
-# Deploying Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio
+# Serving Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio
 Welcome to the comprehensive guide on deploying the [Meta Llama-2-13b chat](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
 In this tutorial, you will not only learn how to harness the power of Llama-2, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`,
 which are optimized for deploying and scaling large language models.
@@ -158,7 +158,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
 **Deploy RayServe Cluster**
 
 ```bash
-cd gen-ai/inference/llama2-13b-chat-rayserve-inf2
+cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2
 kubectl apply -f ray-service-llama2.yaml
 ```
 
@@ -282,7 +282,7 @@ The following YAML script (`gen-ai/inference/llama2-13b-chat-rayserve-inf2/gradi
 To deploy this, execute:
 
 ```bash
-cd gen-ai/inference/llama2-13b-chat-rayserve-inf2/
+cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2/
 kubectl apply -f gradio-ui.yaml
 ```
 
@@ -330,7 +330,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
 **Step1:** Delete Gradio App and Llama2 Inference deployment
 
 ```bash
-cd gen-ai/inference/llama2-13b-chat-rayserve-inf2
+cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2
 kubectl delete -f gradio-ui.yaml
 kubectl delete -f ray-service-llama2.yaml
 ```
@@ -339,6 +339,6 @@ kubectl delete -f ray-service-llama2.yaml
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ai-ml/trainium-inferentia
+cd data-on-eks/ai-ml/trainium-inferentia
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/Neuron/llama3-inf2.md b/website/docs/gen-ai/inference/Neuron/llama3-inf2.md
@@ -1,7 +1,7 @@
 ---
 title: Llama-3-8B on Inferentia2
 sidebar_position: 3
-description: Deploy Llama-3 models on AWS Inferentia accelerators for efficient inference.
+description: Serve Llama-3 models on AWS Inferentia accelerators for efficient inference.
 ---
 import CollapsibleContent from '../../../../src/components/CollapsibleContent';
 
@@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab
 :::
 
 
-# Deploying Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio
+# Serving Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio
 
 Welcome to the comprehensive guide on deploying the [Meta Llama-3-8B Instruct](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
 
@@ -158,7 +158,7 @@ To deploy the llama3-8B-Instruct model, it's essential to configure your Hugging
 
 export  HUGGING_FACE_HUB_TOKEN=<Your-Hugging-Face-Hub-Token-Value>
 
-cd ./../gen-ai/inference/llama3-8b-rayserve-inf2
+cd data-on-eks/gen-ai/inference/llama3-8b-rayserve-inf2
 envsubst < ray-service-llama3.yaml| kubectl apply -f -
 ```
 
@@ -244,7 +244,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
 First, lets build the docker container for the client app.
 
 ```bash
-cd ../gradio-ui
+cd data-on-eks/gen-ai/inference/gradio-ui
 docker build --platform=linux/amd64 \
     -t gradio-app:llama \
     --build-arg GRADIO_APP="gradio-app-llama.py" \
@@ -298,14 +298,14 @@ docker rmi gradio-app:llama
 **Step2:** Delete Ray Cluster
 
 ```bash
-cd ../llama3-8b-instruct-rayserve-inf2
+cd data-on-eks/gen-ai/inference/llama3-8b-instruct-rayserve-inf2
 kubectl delete -f ray-service-llama3.yaml
 ```
 
 **Step3:** Cleanup the EKS Cluster
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ../../../ai-ml/trainium-inferentia/
+cd data-on-eks/ai-ml/trainium-inferentia/
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/Neuron/rayserve-ha.md b/website/docs/gen-ai/inference/Neuron/rayserve-ha.md
@@ -66,7 +66,7 @@ export TF_VAR_enable_rayserve_ha_elastic_cache_redis=true
 Then, run the `install.sh` script to install the EKS cluster with KubeRay operator and other add-ons.
 
 ```bash
-cd ai-ml/trainimum-inferentia
+cd data-on-eks/ai-ml/trainimum-inferentia
 ./install.sh
 ```
 
@@ -135,7 +135,7 @@ With the above `RayService` configuration, we have enabled GCS fault tolerance f
 Let's apply the above `RayService` configuration and check the behavior.
 
 ```bash
-cd ../../gen-ai/inference/
+cd data-on-eks/gen-ai/inference/
 envsubst < mistral-7b-rayserve-inf2/ray-service-mistral-ft.yaml| kubectl apply -f -
 ```
 
@@ -202,7 +202,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou
 **Step1:** Delete Gradio App and mistral Inference deployment
 
 ```bash
-cd gen-ai/inference/mistral-7b-rayserve-inf2
+cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2
 kubectl delete -f gradio-ui.yaml
 kubectl delete -f ray-service-mistral-ft.yaml
 ```
@@ -211,6 +211,6 @@ kubectl delete -f ray-service-mistral-ft.yaml
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ../../../ai-ml/trainium-inferentia/
+cd data-on-eks/ai-ml/trainium-inferentia/
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md b/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md
@@ -14,7 +14,7 @@ This example blueprint deploys a `stable-diffusion-xl-base-1-0` model on Inferen
 
 :::
 
-# Deploying Stable Diffusion XL Base  Model with Inferentia, Ray Serve and Gradio
+# Serving Stable Diffusion XL Base  Model with Inferentia, Ray Serve and Gradio
 Welcome to the comprehensive guide on deploying the [Stable Diffusion XL Base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
 In this tutorial, you will not only learn how to harness the power of Stable Diffusion models, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`,
 which are optimized for deploying and scaling large language models.
@@ -135,7 +135,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia
 **Deploy RayServe Cluster**
 
 ```bash
-cd ../../gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
+cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
 kubectl apply -f ray-service-stablediffusion.yaml
 ```
 
@@ -217,7 +217,7 @@ Let's move forward with setting up the Gradio app as a Docker container running
 First, lets build the docker container for the client app.
 
 ```bash
-cd ../gradio-ui
+cd data-on-eks/gen-ai/inference/gradio-ui
 docker build --platform=linux/amd64 \
     -t gradio-app:sd \
     --build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \
@@ -276,14 +276,14 @@ docker rmi gradio-app:sd
 **Step2:** Delete Ray Cluster
 
 ```bash
-cd ../stable-diffusion-xl-base-rayserve-inf2
+cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2
 kubectl delete -f ray-service-stablediffusion.yaml
 ```
 
 **Step3:** Cleanup the EKS Cluster
 This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order.
 
 ```bash
-cd ../../../ai-ml/trainium-inferentia/
+cd data-on-eks/ai-ml/trainium-inferentia/
 ./cleanup.sh
 ```
diff --git a/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md b/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md
@@ -163,7 +163,7 @@ Having deployed the EKS cluster with all the necessary components, we can now pr
 This will apply the RayService configuration and deploy the cluster on your EKS setup.
 
 ```bash
-cd ../../gen-ai/inference/vllm-rayserve-inf2
+cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2
 
 kubectl apply -f vllm-rayserve-deployment.yaml
 ```
@@ -258,7 +258,7 @@ kubectl -n vllm port-forward svc/vllm-llama3-inf2-serve-svc 8000:8000
 To run the Python client application in a virtual environment, follow these steps:
 
 ```bash
-cd gen-ai/inference/vllm-rayserve-inf2
+cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2
 python3 -m venv .venv
 source .venv/bin/activate
 pip3 install openai
@@ -588,13 +588,6 @@ Each of these files contain the following Performance Benchmarking Metrics:
 
 ```results_number_output_tokens_*```: Number of output tokens in the requests (Output length)
 
-## Cleanup
-
-To remove all resources created by this deployment, run:
-
-```bash
-./cleanup.sh
-```
 ## Conclusion
 
 In summary, when it comes to deploying and scaling Llama-3, AWS Trn1/Inf2 instances offer a compelling advantage.
@@ -615,7 +608,7 @@ kubectl delete -f vllm-rayserve-deployment.yaml
 Destroy the EKS Cluster and resources
 
 ```bash
-cd ../../../ai-ml/trainium-inferentia/
+cd data-on-eks/ai-ml/trainium-inferentia/
 
 ./cleanup.sh
 ```
diff --git a/website/docs/resources/binpacking-custom-scheduler-eks.md b/website/docs/resources/binpacking-custom-scheduler-eks.md
@@ -10,9 +10,9 @@ sidebar_label: Bin packing for Amazon EKS
 In this post, we will show you how to enable a custom scheduler with Amazon EKS when running DoEKS especially for Spark on EKS, including OSS Spark and EMR on EKS. The custom scheduler is a custom Kubernetes scheduler with ```MostAllocated``` strategy running in data plane.
 
 ### Why bin packing
-By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency. 
+By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency.
 
-Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the 
+Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the
 
 ```MostAllocated``` in EMR on EKS.
 
@@ -71,12 +71,12 @@ spec:
   volumes:
     - name: spark-local-dir-1
       hostPath:
-        path: /local1  
-  initContainers:  
+        path: /local1
+  initContainers:
   - name: volume-permission
     image: public.ecr.aws/docker/library/busybox
     # grant volume access to hadoop user
-    command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1']  
+    command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1']
     volumeMounts:
       - name: spark-local-dir-1
         mountPath: /data1