CUDA error with <bitsandbytes.optim.PagedAdamW> optimzier #1856

shawnricecake · 2024-10-17T00:24:52Z

Hi,

I used the config named recipes/configs/mistral/7B_full_low_memory.yaml

the default optimizer inside is bitsandbytes.optim.PagedAdamW, which will raise the error as follows,

[rank6]: Traceback (most recent call last):
[rank6]:   File "/sensei-fs/users/my_name/code/finetune/finetune_distributed.py", line 776, in <module>
[rank6]:     sys.exit(recipe_main())
[rank6]:   File "/sensei-fs/users/my_name/code/finetune/torchtune/torchtune/config/_parse.py", line 99, in wrapper
[rank6]:     sys.exit(recipe_main(conf))
[rank6]:   File "/sensei-fs/users/my_name/code/finetune/finetune_distributed.py", line 771, in recipe_main
[rank6]:     recipe.train()
[rank6]:   File "/sensei-fs/users/my_name/code/finetune/finetune_distributed.py", line 680, in train
[rank6]:     self._optimizer.step()
[rank6]:   File "/opt/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank6]:     out = func(*args, **kwargs)
[rank6]:   File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/opt/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 292, in step
[rank6]:     torch.cuda.synchronize()
[rank6]:   File "/opt/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 892, in synchronize
[rank6]:     return torch._C._cuda_synchronize()
[rank6]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank6]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

pbontrager · 2024-10-18T17:10:17Z

Can you share the full command you ran along with some environment details such as your pytorch, cuda, and bitandbytes versions?

shawnricecake · 2024-10-18T18:21:33Z

Can you share the full command you ran along with some environment details such as your pytorch, cuda, and bitandbytes versions?

Sorry, I forgot that.

The environment is as follows,

torch == 2.4.0+cu124
bitandbytes == 0.44.1
torchao == 0.4.0+cu124
CUDA == 12.4
GPU == H100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error with <bitsandbytes.optim.PagedAdamW> optimzier #1856

CUDA error with <bitsandbytes.optim.PagedAdamW> optimzier #1856

shawnricecake commented Oct 17, 2024

pbontrager commented Oct 18, 2024

shawnricecake commented Oct 18, 2024 •

edited

Loading

CUDA error with <bitsandbytes.optim.PagedAdamW> optimzier #1856

CUDA error with <bitsandbytes.optim.PagedAdamW> optimzier #1856

Comments

shawnricecake commented Oct 17, 2024

pbontrager commented Oct 18, 2024

shawnricecake commented Oct 18, 2024 • edited Loading

shawnricecake commented Oct 18, 2024 •

edited

Loading