Skip to content

"RuntimeError: CUDA driver error: unknown error" when Fine Tuning Llama-3.2-11B-Vision-Instruct #2408

Open
@ananya-kumbhare

Description

@ananya-kumbhare

Hello, I’m trying to fine-tune Llama 3.2 11B Vision Instruct to take inputs of an image and output text and a number.
I have been following the process documented by the Unsloth notebook:
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb

I am getting the following error from the below line of the attached training.py file:
trainer_stats = trainer.train()

The error (shown below) is cryptic and I could not find much help after searching about it...


Going To Create the Trainer
Created the trainer
GPU = NVIDIA GeForce RTX 4080 SUPER. Max memory = 15.992 GB.
8.525 GB of memory reserved.
Shown current memory stats
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\ /| Num examples = 10 | Num Epochs = 30 | Total steps = 30
O^O/ _/ \ Batch size per device = 2 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
"-____-" Trainable parameters = 67,174,400/11,000,000,000 (0.61% trained)
0%| | 0/30 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False.
Unsloth: Will smartly offload gradients to save VRAM!
Traceback (most recent call last):
File "/home/ananya/AnanyaIR/training.py", line 185, in
File "/home/ananya/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
File "", line 315, in _fast_inner_training_loop
File "", line 77, in _unsloth_training_step
File "/home/ananya/.local/lib/python3.12/site-packages/accelerate/accelerator.py", line 2454, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/init.py", line 347, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
File "/home/ananya/.local/lib/python3.12/site-packages/unsloth_zoo/gradient_checkpointing.py", line 554, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/transformers/models/mllama/modeling_mllama.py", line 960, in forward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
File "/tmp/unsloth_compiled_cache/unsloth_compiled_module_mllama.py", line 568, in forward
File "/tmp/unsloth_compiled_cache/unsloth_compiled_module_mllama.py", line 535, in MllamaTextCrossSdpaAttention_forward
RuntimeError: CUDA driver error: unknown error
0%| | 0/30 [00:15<?, ?it/s]


If anyone knows what the cause of this error might be, I’d really appreciate the help. Thank you.

PS:
Some sources with similar unknown errors indicated a possible out of memory issue and I tried setting
gpu_memory_utilization = 0.6
in the FastVisionModel.from_pretrained call, though that resulted in another error
TypeError: MllamaForConditionalGeneration.init() got an unexpected keyword argument 'gpu_memory_utilization'
So it looks like that parameter cannot be set here

training.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions