Description
Hello, I’m trying to fine-tune Llama 3.2 11B Vision Instruct to take inputs of an image and output text and a number.
I have been following the process documented by the Unsloth notebook:
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb
I am getting the following error from the below line of the attached training.py file:
trainer_stats = trainer.train()
The error (shown below) is cryptic and I could not find much help after searching about it...
Going To Create the Trainer
Created the trainer
GPU = NVIDIA GeForce RTX 4080 SUPER. Max memory = 15.992 GB.
8.525 GB of memory reserved.
Shown current memory stats
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\ /| Num examples = 10 | Num Epochs = 30 | Total steps = 30
O^O/ _/ \ Batch size per device = 2 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
"-____-" Trainable parameters = 67,174,400/11,000,000,000 (0.61% trained)
0%| | 0/30 [00:00<?, ?it/s]use_cache=True
is incompatible with gradient checkpointing. Setting use_cache=False
.
Unsloth: Will smartly offload gradients to save VRAM!
Traceback (most recent call last):
File "/home/ananya/AnanyaIR/training.py", line 185, in
File "/home/ananya/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
File "", line 315, in _fast_inner_training_loop
File "", line 77, in _unsloth_training_step
File "/home/ananya/.local/lib/python3.12/site-packages/accelerate/accelerator.py", line 2454, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/init.py", line 347, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
File "/home/ananya/.local/lib/python3.12/site-packages/unsloth_zoo/gradient_checkpointing.py", line 554, in backward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/transformers/models/mllama/modeling_mllama.py", line 960, in forward
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
File "/home/ananya/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
File "/tmp/unsloth_compiled_cache/unsloth_compiled_module_mllama.py", line 568, in forward
File "/tmp/unsloth_compiled_cache/unsloth_compiled_module_mllama.py", line 535, in MllamaTextCrossSdpaAttention_forward
RuntimeError: CUDA driver error: unknown error
0%| | 0/30 [00:15<?, ?it/s]
If anyone knows what the cause of this error might be, I’d really appreciate the help. Thank you.
PS:
Some sources with similar unknown errors indicated a possible out of memory issue and I tried setting
gpu_memory_utilization = 0.6
in the FastVisionModel.from_pretrained call, though that resulted in another error
TypeError: MllamaForConditionalGeneration.init() got an unexpected keyword argument 'gpu_memory_utilization'
So it looks like that parameter cannot be set here