Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Serve Mistral LLM GPU Deploy Worker fails readiness check due to /home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash) #661

Open
1 task done
calvinraveenthran opened this issue Sep 20, 2024 · 6 comments

Comments

@calvinraveenthran
Copy link

calvinraveenthran commented Sep 20, 2024

Description

I am following this doc: https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-rayserve

Once I run

cd data-on-eks/gen-ai/inference/vllm-rayserve-gpu

envsubst < ray-service-vllm.yaml| kubectl apply -f -

I notice that gpu worker pod continually fails the readiness check.

Here are the events:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  12m                    default-scheduler  Successfully assigned rayserve-vllm/vllm-raycluster-nkcgg-worker-gpu-group-rwlzt to ip-100-64-58-220.us-west-2.compute.internal
  Normal   Pulled     12m                    kubelet            Container image "public.ecr.aws/data-on-eks/ray2.24.0-py310-vllm-gpu:v3" already present on machine
  Normal   Created    12m                    kubelet            Created container wait-gcs-ready
  Normal   Started    12m                    kubelet            Started container wait-gcs-ready
  Normal   Pulled     12m                    kubelet            Container image "public.ecr.aws/data-on-eks/ray2.24.0-py310-vllm-gpu:v3" already present on machine
  Normal   Created    12m                    kubelet            Created container ray-worker
  Normal   Started    12m                    kubelet            Started container ray-worker
  Warning  Unhealthy  2m41s (x123 over 12m)  kubelet            Readiness probe failed: success
bash: /home/ray/anaconda3/lib/libtinfo.so.6: no version information available (required by bash)

If your request is for a new feature, please use the Feature request template.

  • ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Module version [Required]:

  • Terraform version:

  • Provider version(s):

Reproduction Code [Required]

Steps to reproduce the behavior:

  1. Follow each step on here: https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-rayserve

Expected behavior

Worker pod should be running.

Actual behavior

Worker pod fails readiness check.

Terminal Output Screenshot(s)

Additional context

@vara-bonthu
Copy link
Collaborator

@shivam-dubey-1 @ratnopamc

@shivam-dubey-1
Copy link
Contributor

Hi @calvinraveenthran ,
I deployed the stack and couldn't find any error.

kubectl get pods -n rayserve-vllm
NAME READY STATUS RESTARTS AGE
vllm-raycluster-mt8w2-head-rwxsb 2/2 Running 0 12m
vllm-raycluster-mt8w2-worker-gpu-group-cddjr 1/1 Running 0 12m

Can you please share the KubeRay and Ray service logs? You can use the following command to get more details:
kubectl describe rayservice vllm -n rayserve-vllm

@radudobrinescu
Copy link

Hi @calvinraveenthran

Mistral-7B-Instruct-v0.2 is a gated model, can you confirm you have been granted access to the model in HF?

@calvinraveenthran
Copy link
Author

Hi @radudobrinescu

I do have access to the model in HF.

@shivam-dubey-1 will test this today and get back to you.

@calvinraveenthran
Copy link
Author

When I retry the build I get another error:

│ Error: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "EC2NodeClass" in version "karpenter.k8s.aws/v1"
│ ensure CRDs are installed first, resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "NodePool" in version "karpenter.sh/v1"
│ ensure CRDs are installed first]
│ 
│   with module.data_addons.helm_release.karpenter_resources["x86-cpu-karpenter"],
│   on .terraform/modules/data_addons/karpenter-resources.tf line 6, in resource "helm_release" "karpenter_resources":
│    6: resource "helm_release" "karpenter_resources" {
│ 

@milosjovanov
Copy link

When I retry the build I get another error:

│ Error: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "EC2NodeClass" in version "karpenter.k8s.aws/v1"
│ ensure CRDs are installed first, resource mapping not found for name: "x86-cpu-karpenter" namespace: "" from "": no matches for kind "NodePool" in version "karpenter.sh/v1"
│ ensure CRDs are installed first]
│ 
│   with module.data_addons.helm_release.karpenter_resources["x86-cpu-karpenter"],
│   on .terraform/modules/data_addons/karpenter-resources.tf line 6, in resource "helm_release" "karpenter_resources":
│    6: resource "helm_release" "karpenter_resources" {
│ 

Look at the #669 - pin data_addons module to 1.33.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants