docs: add fine-tuning example (#1374)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-05-28 06:25:00 +00:00 · 2023-12-01 19:11:45 +01:00 · 2023-12-01 19:11:45 +01:00 · 2b2007ae9e
commit 2b2007ae9e
parent e94a34be8c
7 changed files with 1969 additions and 15 deletions
--- a/examples/README.md
+++ b/examples/README.md
@ -41,6 +41,14 @@ This example show how to use LocalAI inside Kubernetes with [k8sgpt](https://k8s

 ![Screenshot from 2023-06-19 23-58-47](https://github.com/go-skynet/go-ggml-transformers.cpp/assets/2420543/cab87409-ee68-44ae-8d53-41627fb49509)

+### Fine-tuning a model and convert it to gguf to use it with LocalAI
+
+_by [@mudler](https://github.com/mudler)_
+
+This example is an e2e example on how to fine-tune a model with [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) and convert it to gguf to use it with LocalAI.
+
+[Check it out here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/)
+
 ### Flowise

 _by [@mudler](https://github.com/mudler)_
--- a/examples/e2e-fine-tuning/README.md
+++ b/examples/e2e-fine-tuning/README.md
@ -0,0 +1,83 @@
+This is an example of fine-tuning a LLM model to use with [LocalAI](https://github/mudler/LocalAI) written by [@mudler](https://github.com/mudler).
+
+Specifically, this example shows how to use [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to fine-tune a LLM model to consume with LocalAI as a `gguf` model.
+
+A notebook is provided that currently works on _very small_ datasets on Google colab on the free instance. It is far from producing good models, but it gives a sense of how to use the code to use with a better dataset and configurations, and how to use the model produced with LocalAI.
+
+## Requirements
+
+For this example you will need at least a 12GB VRAM of GPU and a Linux box.
+The notebook is tested on Google Colab with a Tesla T4 GPU.
+
+## Clone this directory
+
+Clone the repository and enter the example directory:
+
+```bash
+git clone http://github.com/mudler/LocalAI
+cd LocalAI/examples/e2e-fine-tuning
+```
+
+## Install dependencies
+
+```bash
+# Install axolotl and dependencies
+git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
+pip install packaging
+pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd
+
+# https://github.com/oobabooga/text-generation-webui/issues/4238
+pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+```
+
+Configure accelerate:
+
+```bash
+accelerate config default
+```
+
+## Fine-tuning
+
+We will need to configure axolotl. In this example is provided a file to use `axolotl.yaml` that uses openllama-3b for fine-tuning. Copy the `axolotl.yaml` file and edit it to your needs. The dataset needs to be next to it as `dataset.json`. The format used is `completion` which is a list of JSON objects with a `text` field with the full text to train the LLM with.
+
+If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:
+
+```bash
+# Optional pre-tokenize (run only if big dataset)
+python -m axolotl.cli.preprocess axolotl.yaml
+```
+
+Now we are ready to start the fine-tuning process:
+```bash
+# Fine-tune
+accelerate launch -m axolotl.cli.train axolotl.yaml
+```
+
+After we have finished the fine-tuning, we merge the Lora base with the model:
+```bash
+# Merge lora
+python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False
+```
+
+And we convert it to the gguf format that LocalAI can consume:
+
+```bash
+
+# Convert to gguf
+git clone https://github.com/ggerganov/llama.cpp.git
+pushd llama.cpp && make LLAMA_CUBLAS=1 && popd
+
+# We need to convert the pytorch model into ggml for quantization
+# It crates 'ggml-model-f16.bin' in the 'merged' directory.
+pushd llama.cpp && python convert.py --outtype f16 \
+    ../qlora-out/merged/pytorch_model-00001-of-00002.bin && popd
+
+# Start off by making a basic q4_0 4-bit quantization.
+# It's important to have 'ggml' in the name of the quant for some
+# software to recognize it's file format.
+pushd llama.cpp &&  ./quantize ../qlora-out/merged/ggml-model-f16.gguf \
+    ../custom-model-q4_0.bin q4_0
+
+```
+
+Now you should have ended up with a `custom-model-q4_0.bin` file that you can copy in the LocalAI models directory and use it with LocalAI.
--- a/examples/e2e-fine-tuning/axolotl.yaml
+++ b/examples/e2e-fine-tuning/axolotl.yaml
@ -0,0 +1,63 @@
+
+base_model: openlm-research/open_llama_3b_v2
+model_type: LlamaForCausalLM
+tokenizer_type: LlamaTokenizer
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+push_dataset_to_hub: false
+datasets:
+- path: dataset.json
+  ds_type: json
+  type: completion
+dataset_prepared_path:
+val_set_size: 0.05
+adapter: qlora
+lora_model_dir:
+sequence_len: 1024
+sample_packing: true
+lora_r: 8
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+output_dir: ./qlora-out
+gradient_accumulation_steps: 1
+micro_batch_size: 2
+num_epochs: 4
+optimizer: paged_adamw_32bit
+torchdistx_path:
+lr_scheduler: cosine
+learning_rate: 0.0002
+train_on_inputs: false
+group_by_length: false
+bf16: false
+fp16: true
+tf32: false
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: false
+gptq_groupsize:
+gptq_model_v1:
+warmup_steps: 20
+eval_steps: 0.05
+save_steps:
+debug:
+deepspeed:
+weight_decay: 0.1
+fsdp:
+fsdp_config:
+special_tokens:
+bos_token: "<s>"
+eos_token: "</s>"
+unk_token: "<unk>"
--- a/examples/e2e-fine-tuning/notebook.ipynb
+++ b/examples/e2e-fine-tuning/notebook.ipynb