Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to `/home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)` #661

calvinraveenthran · 2024-09-20T22:47:26Z

Description

I am following this doc: https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-rayserve

Once I run

cd data-on-eks/gen-ai/inference/vllm-rayserve-gpu

envsubst < ray-service-vllm.yaml| kubectl apply -f -

I notice that gpu worker pod continually fails the readiness check.

Here are the events:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  12m                    default-scheduler  Successfully assigned rayserve-vllm/vllm-raycluster-nkcgg-worker-gpu-group-rwlzt to ip-100-64-58-220.us-west-2.compute.internal
  Normal   Pulled     12m                    kubelet            Container image "public.ecr.aws/data-on-eks/ray2.24.0-py310-vllm-gpu:v3" already present on machine
  Normal   Created    12m                    kubelet            Created container wait-gcs-ready
  Normal   Started    12m                    kubelet            Started container wait-gcs-ready
  Normal   Pulled     12m                    kubelet            Container image "public.ecr.aws/data-on-eks/ray2.24.0-py310-vllm-gpu:v3" already present on machine
  Normal   Created    12m                    kubelet            Created container ray-worker
  Normal   Started    12m                    kubelet            Started container ray-worker
  Warning  Unhealthy  2m41s (x123 over 12m)  kubelet            Readiness probe failed: success
bash: /home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)

If your request is for a new feature, please use the Feature request template.

✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Module version [Required]:
Terraform version:

Provider version(s):

Reproduction Code [Required]

Steps to reproduce the behavior:

Follow each step on here: https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-rayserve

Expected behavior

Worker pod should be running.

Actual behavior

Worker pod fails readiness check.

Terminal Output Screenshot(s)

Additional context

The text was updated successfully, but these errors were encountered:

vara-bonthu · 2024-09-24T15:28:53Z

@shivam-dubey-1 @ratnopamc

shivam-dubey-1 · 2024-09-24T19:14:36Z

Hi @calvinraveenthran ,
I deployed the stack and couldn't find any error.

kubectl get pods -n rayserve-vllm
NAME READY STATUS RESTARTS AGE
vllm-raycluster-mt8w2-head-rwxsb 2/2 Running 0 12m
vllm-raycluster-mt8w2-worker-gpu-group-cddjr 1/1 Running 0 12m

Can you please share the KubeRay and Ray service logs? You can use the following command to get more details:
kubectl describe rayservice vllm -n rayserve-vllm

radudobrinescu · 2024-10-10T08:28:17Z

Hi @calvinraveenthran

Mistral-7B-Instruct-v0.2 is a gated model, can you confirm you have been granted access to the model in HF?

calvinraveenthran · 2024-10-10T15:55:10Z

Hi @radudobrinescu

I do have access to the model in HF.

@shivam-dubey-1 will test this today and get back to you.

calvinraveenthran · 2024-10-11T20:35:43Z

When I retry the build I get another error:

│ Error: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "EC2NodeClass" in version "karpenter.k8s.aws/v1"
│ ensure CRDs are installed first, resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "NodePool" in version "karpenter.sh/v1"
│ ensure CRDs are installed first]
│ 
│   with module.data_addons.helm_release.karpenter_resources["x86-cpu-karpenter"],
│   on .terraform/modules/data_addons/karpenter-resources.tf line 6, in resource "helm_release" "karpenter_resources":
│    6: resource "helm_release" "karpenter_resources" {
│

milosjovanov · 2024-10-14T09:27:34Z

When I retry the build I get another error:

│ Error: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "EC2NodeClass" in version "karpenter.k8s.aws/v1"
│ ensure CRDs are installed first, resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "NodePool" in version "karpenter.sh/v1"
│ ensure CRDs are installed first]
│ 
│   with module.data_addons.helm_release.karpenter_resources["x86-cpu-karpenter"],
│   on .terraform/modules/data_addons/karpenter-resources.tf line 6, in resource "helm_release" "karpenter_resources":
│    6: resource "helm_release" "karpenter_resources" {
│

Look at the #669 - pin data_addons module to 1.33.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to `/home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)` #661

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to `/home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)` #661

calvinraveenthran commented Sep 20, 2024 •

edited

Loading

vara-bonthu commented Sep 24, 2024

shivam-dubey-1 commented Sep 24, 2024

radudobrinescu commented Oct 10, 2024

calvinraveenthran commented Oct 10, 2024

calvinraveenthran commented Oct 11, 2024

milosjovanov commented Oct 14, 2024

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to /home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash) #661

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to /home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash) #661

Comments

calvinraveenthran commented Sep 20, 2024 • edited Loading

Description

⚠️ Note

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

vara-bonthu commented Sep 24, 2024

shivam-dubey-1 commented Sep 24, 2024

radudobrinescu commented Oct 10, 2024

calvinraveenthran commented Oct 10, 2024

calvinraveenthran commented Oct 11, 2024

milosjovanov commented Oct 14, 2024

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to `/home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)` #661

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to `/home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)` #661

calvinraveenthran commented Sep 20, 2024 •

edited

Loading