## Finetuning a model and using it with LocalAI

This is an example of fine-tuning a LLM model to use with [LocalAI](https://github.com/mudler/LocalAI) written by [@mudler](https://github.com/mudler).

Specifically, this example shows how to use [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to fine-tune a LLM model to consume with LocalAI as a `gguf` model.

# Important!

Before starting, make sure you have selected GPU runtime : Runtime -> Change runtime type -> GPU (T4)!

Change the model to link to your dataset. Upload the dataset as `output.jsonl` in the root tree and edit the model file (model.yml) with:

```
# local
datasets:
  - path: /content/output.jsonl
    ds_type: json
    type: completion

```

A full example:

```yaml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
push_dataset_to_hub: false
datasets:
  - path: /content/output.jsonl
    ds_type: json
    type: completion
dataset_prepared_path:
val_set_size: 0.05
adapter: qlora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
eval_steps: 0.05
save_steps:
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

```

In [1]:
# Install axolotl
!git clone https://github.com/OpenAccess-AI-Collective/axolotl  && cd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a #0.3.0
!cd axolotl
!pip install packaging
!cd axolotl && pip install -e '.[flash-attn,deepspeed]'

Cloning into 'axolotl'...
remote: Enumerating objects: 7525, done.[K
remote: Counting objects: 100% (1726/1726), done.[K
remote: Compressing objects: 100% (385/385), done.[K
remote: Total 7525 (delta 1525), reused 1409 (delta 1319), pack-reused 5799[K
Receiving objects: 100% (7525/7525), 2.64 MiB | 10.52 MiB/s, done.
Resolving deltas: 100% (4854/4854), done.
Note: switching to '797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 797f3dd don't train if eval split is too

In [2]:
!accelerate config default

accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml


In [3]:
!pip install accelerate bitsandbytes
!pwd

/content


In [4]:
import torch
torch.cuda.is_available()

True

In [5]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.1
    Uninstalling transformers-4.35.1:
      Successfully uninstalled transformers-4.35.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
axolotl 0.3.0 requires transformers==4.35.1, but you have transformers 4.35.2 which is incompatible.[0m[31m
[0mSuccessfully installed transformers-4.35.2


In [6]:
# https://github.com/oobabooga/text-generation-webui/issues/4238
!pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Collecting flash-attn==2.3.0+cu117torch2.0cxx11abiFALSE
  Downloading https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl (30.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.0/30.0 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: flash-attn
  Attempting uninstall: flash-attn
    Found existing installation: flash-attn 2.3.3
    Uninstalling flash-attn-2.3.3:
      Successfully uninstalled flash-attn-2.3.3
Successfully installed flash-attn-2.3.0


Start the training process (fine-tuning)

In [None]:
!accelerate launch -m axolotl.cli.train model.yml --load_in_8bit=False

2023-11-18 10:15:30.581758: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-18 10:15:30.581829: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-18 10:15:30.581870: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                          

In [None]:
!python3 -m axolotl.cli.merge_lora model.yml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

In [None]:

!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && make GGML_CUDA=1



In [None]:

# We need to convert the pytorch model into ggml for quantization
# It crates 'ggml-model-f16.bin' in the 'merged' directory.
!cd llama.cpp && python convert.py --outtype f16 \
    /content/qlora-out/merged/pytorch_model-00001-of-00002.bin


In [None]:

# Start off by making a basic q4_0 4-bit quantization.
# It's important to have 'ggml' in the name of the quant for some
# software to recognize it's file format.
!cd llama.cpp &&  ./quantize /content/qlora-out/merged/ggml-model-f16.gguf \
    /content/custom-model-q4_0.bin q4_0