RuntimeError: Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again.

Hi, I am a beginner and I've encountered an issue when trying to save to gguf. 
Initially, I received the error message: "Unsloth: The file ('llama.cpp/llama-quantize' or 'llama.cpp/llama-quantize.exe' if you are on Windows WSL) or 'llama.cpp/quantize' does not exist." 

Following the instructions at  https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cpu-build, I successfully built llama.cpp, and moved the "llama-quantize.exe" to ./llama.cpp. (https://github.com/unslothai/unsloth/issues/748)

`model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")`

However, I got the new error when converting it into q4_k_m:

`
Unsloth: Conversion completed! Output location: C:\Code\unsloth\model\unsloth.BF16.gguf
Unsloth: [2] Converting GGUF 16bit into q4_k_m. This might take 20 minutes...
'.' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "C:\Code\unsloth\test_lora.py", line 36, in <module>
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
  File "C:\Users\User\anaconda3\envs\unsloth_env\Lib\site-packages\unsloth\save.py", line 1748, in unsloth_save_pretrained_gguf
    all_file_locations, want_full_precision = save_to_gguf(
                                              ^^^^^^^^^^^^^
  File "C:\Users\User\anaconda3\envs\unsloth_env\Lib\site-packages\unsloth\save.py", line 1251, in save_to_gguf
    raise RuntimeError(
RuntimeError: Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again.
You do not need to close this Python program. Run the following commands in a new terminal:
You must run this in the same folder as you're saving your model.
git clone --recursive https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && make all -j
Once that's done, redo the quantization.
`


`Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
GPU: NVIDIA GeForce RTX 4080. Max memory: 15.992 GB. Platform: Windows.
Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0`


What should I do?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again. #1781

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again. #1781

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions