Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading utanh_bf16 #850

Open
nikolaydubina opened this issue Oct 14, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@nikolaydubina
Copy link
Contributor

Describe the bug

LLAMA 3.2 11B Vision cannot start after loading model

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading utanh_bf16
image

my system

DRIVER_VERSION=550.90.07

Latest commit or version

ghcr.io/ericlbuehler/mistral.rs:cuda-90-sha-1caf83a@sha256:095518a16d1f0a9fa2e212463736ccb540eeb0f88f21c10a2123ab8cf481b83e

References

image
@nikolaydubina nikolaydubina added the bug Something isn't working label Oct 14, 2024
@nikolaydubina
Copy link
Contributor Author

nikolaydubina commented Oct 14, 2024

with
DRIVER_VERSION=535.183.01 (default in GKE)

error

Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading utanh_bf16

in docker there is cuda requirement that does not go as far as 550. it is all 536
image

ENV NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536

this driver uses 12.2.2 cuda
https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-104-12/index.html

@nikolaydubina
Copy link
Contributor Author

nikolaydubina commented Oct 14, 2024

I have 24GB cuda nvidia-l4. but it fails with 3B model as well and does not reach full memory. so this is not out of memory issues.

image

3B has same error on different resource

Error: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") when loading cast_u32_f32

@nikolaydubina
Copy link
Contributor Author

nikolaydubina commented Oct 16, 2024

ok, so CUDA works on nodes. there is something wrong with the CUDA usage or build in mistralrs

+-----------------------------------------------------------------------------------------+                                                                                                                       
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |                                                                                                                       
|-----------------------------------------+------------------------+----------------------+                                                                                                                       
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |                                                                                                                       
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   41C    P8             17W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                          
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

and sample CUDA Pod works too

$ kubectl -n ml logs vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

pods

apiVersion: v1
kind: Pod
metadata:
  name: cuda-info
  namespace: ml
spec:
  restartPolicy: OnFailure
  containers:
    - name: main
      image: cuda:12.4.1-cudnn-devel-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
  name: vector-add
  namespace: ml
spec:
  restartPolicy: OnFailure
  containers:
    - name: main
      image: cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

@nikolaydubina
Copy link
Contributor Author

nikolaydubina commented Oct 17, 2024

Pytorch also works on these nodes.

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-cuda
  namespace: ml
spec:
  containers:
    - name: main
      image: pytorch/pytorch:2.4.1-cuda12.4-cudnn9-devel
      command: ["/bin/sh", "-c", "sleep 1000000"]
      resources:
        limits:
          nvidia.com/gpu: 1
 $ kubectl exec -n ml --stdin --tty pytorch-cuda -- /bin/bash
root@pytorch-cuda:/workspace# python3
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.device_count() 
1
>>> torch.cuda.get_device_name(0)
'NVIDIA L4'
>>> 
root@pytorch-cuda:/workspace# 

HuggingFace PyTorch based HTTP server docker also works here and uses CUDA. https://github.com/nikolaydubina/basic-openai-pytorch-server

@nikolaydubina
Copy link
Contributor Author

@EricLBuehler any tips on why CUDA here is not working?

@vasileermicioi
Copy link

I test python cookbook from examples on google colab and get the same or similar error

[<ipython-input-3-cf1346fa2968>](https://localhost:8080/#) in <cell line: 3>()
      1 from mistralrs import Runner, Which, ChatCompletionRequest
      2 
----> 3 runner = Runner(
      4     which=Which.GGUF(
      5         tok_model_id="microsoft/Phi-3.5-mini-instruct",

ValueError: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading cast_u32_f32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants