docs/examples: enhancements (#1572)

* docs: re-order sections * fix references * Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b * Fix link * Minor corrections * fix: models is a StringSlice, not a String Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP: switch docs theme * content * Fix GH link * enhancements * enhancements * Fixed how to link Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * fixups * logo fix * more fixups * final touches --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
2025-05-21 11:04:59 +00:00 · 2024-01-18 19:41:08 +01:00 · 2024-01-18 19:41:08 +01:00 · 6ca4d38a01
commit 6ca4d38a01
parent b5c93f176a
79 changed files with 1826 additions and 3546 deletions
--- a/docs/content/docs/advanced/_index.en.md
+++ b/docs/content/docs/advanced/_index.en.md
@ -0,0 +1,11 @@
+---
+weight: 20
+title: "Advanced"
+description: "Advanced usage"
+icon: science
+lead: ""
+date: 2020-10-06T08:49:15+00:00
+lastmod: 2020-10-06T08:49:15+00:00
+draft: false
+images: []
+---
--- a/docs/content/docs/advanced/advanced-usage.md
+++ b/docs/content/docs/advanced/advanced-usage.md
@ -0,0 +1,448 @@
+
+++
+disableToc = false
+title = "Advanced usage"
+weight = 21
+url = '/advanced'
+++
+
+### Advanced configuration with YAML files
+
+In order to define default prompts, model parameters (such as custom default `top_p` or `top_k`), LocalAI can be configured to serve user-defined models with a set of default parameters and templates.
+
+In order to configure a model, you can create multiple `yaml` files in the models path or either specify a single YAML configuration file. 
+Consider the following `models` folder in the `example/chatbot-ui`:
+
+```
+base ❯ ls -liah examples/chatbot-ui/models 
+36487587 drwxr-xr-x 2 mudler mudler 4.0K May  3 12:27 .
+36487586 drwxr-xr-x 3 mudler mudler 4.0K May  3 10:42 ..
+36465214 -rw-r--r-- 1 mudler mudler   10 Apr 27 07:46 completion.tmpl
+36464855 -rw-r--r-- 1 mudler mudler   ?G Apr 27 00:08 luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
+36464537 -rw-r--r-- 1 mudler mudler  245 May  3 10:42 gpt-3.5-turbo.yaml
+36467388 -rw-r--r-- 1 mudler mudler  180 Apr 27 07:46 chat.tmpl
+```
+
+In the `gpt-3.5-turbo.yaml` file it is defined the `gpt-3.5-turbo` model which is an alias to use `luna-ai-llama2` with pre-defined options.
+
+For instance, consider the following that declares `gpt-3.5-turbo` backed by the `luna-ai-llama2` model:
+
+```yaml
+name: gpt-3.5-turbo
+# Default model parameters
+parameters:
+  # Relative to the models path
+  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
+  # temperature
+  temperature: 0.3
+  # all the OpenAI request options here..
+
+# Default context size
+context_size: 512
+threads: 10
+# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
+backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv
+
+# Enable prompt caching
+prompt_cache_path: "alpaca-cache"
+prompt_cache_all: true
+
+# stopwords (if supported by the backend)
+stopwords:
+- "HUMAN:"
+- "### Response:"
+# define chat roles
+roles:
+  assistant: '### Response:'
+  system: '### System Instruction:'
+  user: '### Instruction:'
+template:
+  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
+  completion: completion
+  chat: chat
+```
+
+Specifying a `config-file` via CLI allows to declare models in a single file as a list, for instance:
+
+```yaml
+- name: list1
+  parameters:
+    model: testmodel
+  context_size: 512
+  threads: 10
+  stopwords:
+  - "HUMAN:"
+  - "### Response:"
+  roles:
+    user: "HUMAN:"
+    system: "GPT:"
+  template:
+    completion: completion
+    chat: chat
+- name: list2
+  parameters:
+    model: testmodel
+  context_size: 512
+  threads: 10
+  stopwords:
+  - "HUMAN:"
+  - "### Response:"
+  roles:
+    user: "HUMAN:"
+    system: "GPT:"
+  template:
+    completion: completion
+   chat: chat
+```
+
+See also [chatbot-ui](https://github.com/go-skynet/LocalAI/tree/master/examples/chatbot-ui) as an example on how to use config files.
+
+It is possible to specify a full URL or a short-hand URL to a YAML model configuration file and use it on start with local-ai, for example to use phi-2:
+
+```
+local-ai github://mudler/LocalAI/examples/configurations/phi-2.yaml@master
+```
+
+### Full config model file reference
+
+```yaml
+# Model name.
+# The model name is used to identify the model in the API calls.
+name: gpt-3.5-turbo
+
+# Default model parameters.
+# These options can also be specified in the API calls
+parameters:
+  # Relative to the models path
+  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
+  # temperature
+  temperature: 0.3
+  # all the OpenAI request options here..
+  top_k: 
+  top_p: 
+  max_tokens:
+  ignore_eos: true
+  n_keep: 10
+  seed: 
+  mode: 
+  step:
+  negative_prompt:
+  typical_p:
+  tfz:
+  frequency_penalty:
+  mirostat_eta:
+  mirostat_tau:
+  mirostat: 
+  rope_freq_base:
+  rope_freq_scale:
+  negative_prompt_scale:
+
+# Default context size
+context_size: 512
+# Default number of threads
+threads: 10
+# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
+backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv
+# stopwords (if supported by the backend)
+stopwords:
+- "HUMAN:"
+- "### Response:"
+# string to trim space to
+trimspace:
+- string
+# Strings to cut from the response
+cutstrings:
+- "string"
+
+# Directory used to store additional assets
+asset_dir: ""
+
+# define chat roles
+roles:
+  user: "HUMAN:"
+  system: "GPT:"
+  assistant: "ASSISTANT:"
+template:
+  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
+  completion: completion
+  chat: chat
+  edit: edit_template
+  function: function_template
+
+function:
+   disable_no_action: true
+   no_action_function_name: "reply"
+   no_action_description_name: "Reply to the AI assistant"
+
+system_prompt:
+rms_norm_eps:
+# Set it to 8 for llama2 70b
+ngqa: 1
+## LLAMA specific options
+# Enable F16 if backend supports it
+f16: true
+# Enable debugging
+debug: true
+# Enable embeddings
+embeddings: true
+# Mirostat configuration (llama.cpp only)
+mirostat_eta: 0.8
+mirostat_tau: 0.9
+mirostat: 1
+# GPU Layers (only used when built with cublas)
+gpu_layers: 22
+# Enable memory lock
+mmlock: true
+# GPU setting to split the tensor in multiple parts and define a main GPU
+# see llama.cpp for usage
+tensor_split: ""
+main_gpu: ""
+# Define a prompt cache path (relative to the models)
+prompt_cache_path: "prompt-cache"
+# Cache all the prompts
+prompt_cache_all: true
+# Read only
+prompt_cache_ro: false
+# Enable mmap
+mmap: true
+# Enable low vram mode (GPU only)
+low_vram: true
+# Set NUMA mode (CPU only)
+numa: true
+# Lora settings
+lora_adapter: "/path/to/lora/adapter"
+lora_base: "/path/to/lora/base"
+# Disable mulmatq (CUDA)
+no_mulmatq: true
+
+# Diffusers/transformers
+cuda: true
+```
+
+### Prompt templates 
+
+The API doesn't inject a default prompt for talking to the model. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.
+
+<details>
+You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
+
+```
+The below instruction describes a task. Write a response that appropriately completes the request.
+
+### Instruction:
+{{.Input}}
+
+### Response:
+```
+
+See the [prompt-templates](https://github.com/go-skynet/LocalAI/tree/master/prompt-templates) directory in this repository for templates for some of the most popular models.
+
+
+For the edit endpoint, an example template for alpaca-based models can be:
+
+```yaml
+Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+
+### Instruction:
+{{.Instruction}}
+
+### Input:
+{{.Input}}
+
+### Response:
+```
+
+</details>
+
+### Install models using the API
+
+Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.
+
+A curated collection of model files is in the [model-gallery](https://github.com/go-skynet/model-gallery) (work in progress!). The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.
+
+To install for example `lunademo`, you can send a POST call to the `/models/apply` endpoint with the model definition url (`url`) and the name of the model should have in LocalAI (`name`, optional):
+
+```bash
+curl --location 'http://localhost:8080/models/apply' \
+--header 'Content-Type: application/json' \
+--data-raw '{
+    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
+    "name": "lunademo"
+}'
+```
+
+
+### Preloading models during startup
+
+In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup. 
+
+```bash
+PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai
+```
+
+`PRELOAD_MODELS` (or `--preload-models`) takes a list in JSON with the same parameter of the API calls of the `/models/apply` endpoint.
+
+Similarly it can be specified a path to a YAML configuration file containing a list of models with `PRELOAD_MODELS_CONFIG` ( or `--preload-models-config` ):
+
+```yaml
+- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
+  name: gpt4all-j
+# ...
+```
+
+### Automatic prompt caching
+
+LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.
+
+To enable prompt caching, you can control the settings in the model config YAML file:
+
+```yaml
+
+# Enable prompt caching
+prompt_cache_path: "cache"
+prompt_cache_all: true
+
+```
+
+`prompt_cache_path` is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if `prompt_cache_all` is set to `true`.
+
+### Configuring a specific backend for the model
+
+By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.
+
+The available backends are listed in the [model compatibility table]({{%relref "docs/reference/compatibility-table" %}}).
+
+In order to specify a backend for your models, create a model config file in your `models` directory specifying the backend:
+
+```yaml
+name: gpt-3.5-turbo
+
+# Default model parameters
+parameters:
+  # Relative to the models path
+  model: ...
+
+backend: llama-stable
+# ...
+```
+
+### Connect external backends
+
+LocalAI backends are internally implemented using `gRPC` services. This also allows `LocalAI` to connect to external `gRPC` services on start and extend LocalAI functionalities via third-party binaries.
+
+The `--external-grpc-backends` parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is `<BACKEND_NAME>:<BACKEND_URI>`. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.
+
+So for instance, to register a new backend which is a local file:
+
+```
+./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"
+```
+
+Or a remote URI:
+
+```
+./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"
+```
+
+For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):
+
+```bash
+./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"
+```
+
+Note that first is is necessary to create the conda environment with:
+
+```bash
+make -C backend/python/vllm
+```
+
+
+### Environment variables
+
+When LocalAI runs in a container,
+there are additional environment variables available that modify the behavior of LocalAI on startup:
+
+| Environment variable       | Default | Description                                                                                                |
+|----------------------------|---------|------------------------------------------------------------------------------------------------------------|
+| `REBUILD`                  | `false` | Rebuild LocalAI on startup                                                                                 |
+| `BUILD_TYPE`               |         | Build type. Available: `cublas`, `openblas`, `clblas`                                                      |
+| `GO_TAGS`                  |         | Go tags. Available: `stablediffusion`                                                                      |
+| `HUGGINGFACEHUB_API_TOKEN` |         | Special token for interacting with HuggingFace Inference API, required only when using the `langchain-huggingface` backend |
+| `EXTRA_BACKENDS`          |         | A space separated list of backends to prepare. For example `EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers"` prepares the conda environment on start |
+
+Here is how to configure these variables:
+
+```bash
+# Option 1: command line
+docker run --env REBUILD=true localai
+# Option 2: set within an env file
+docker run --env-file .env localai
+```
+
+### CLI parameters
+
+You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.
+
+
+| Parameter                      | Environmental Variable          | Default Variable                                   | Description                                                         |
+| ------------------------------ | ------------------------------- | -------------------------------------------------- | ------------------------------------------------------------------- |
+| --f16                          | $F16                            | false                                              | Enable f16 mode                                                     |
+| --debug                        | $DEBUG                          | false                                              | Enable debug mode                                                   |
+| --cors                         | $CORS                           | false                                              | Enable CORS support                                                 |
+| --cors-allow-origins value     | $CORS_ALLOW_ORIGINS             |                                                    | Specify origins allowed for CORS                                     |
+| --threads value                | $THREADS                        | 4    | Number of threads to use for parallel computation                    |
+| --models-path value            | $MODELS_PATH                    | ./models       | Path to the directory containing models used for inferencing        |
+| --preload-models value         | $PRELOAD_MODELS                 |           | List of models to preload in JSON format at startup                  |
+| --preload-models-config value  | $PRELOAD_MODELS_CONFIG          |  | A config with a list of models to apply at startup. Specify the path to a YAML config file |
+| --config-file value            | $CONFIG_FILE                    |                                         | Path to the config file                                             |
+| --address value                | $ADDRESS                        | :8080                    | Specify the bind address for the API server                         |
+| --image-path value             | $IMAGE_PATH                     |                                     | Path to the directory used to store generated images                             |
+| --context-size value           | $CONTEXT_SIZE                   | 512                 | Default context size of the model                                   |
+| --upload-limit value           | $UPLOAD_LIMIT                   | 15                         | Default upload limit in megabytes (audio file upload)                                  |
+| --galleries                    | $GALLERIES                      |                                                    | Allows to set galleries from command line                           |
+|--parallel-requests              | $PARALLEL_REQUESTS     |   false |            Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm |
+| --single-active-backend   | $SINGLE_ACTIVE_BACKEND |  false |    Allow only one backend to be running |
+| --api-keys value |   $API_KEY | empty |  List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys.
+| --enable-watchdog-idle | $WATCHDOG_IDLE | false | Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE]
+| --enable-watchdog-busy   |     $WATCHDOG_BUSY | false |         Enable watchdog for stopping busy backends that exceed a defined threshold.|
+| --watchdog-busy-timeout value | $WATCHDOG_BUSY_TIMEOUT | 5m | Watchdog timeout. This will restart the backend if it crashes.  |
+| --watchdog-idle-timeout value | $WATCHDOG_IDLE_TIMEOUT | 15m | Watchdog idle timeout. This will restart the backend if it crashes. |
+| --preload-backend-only | $PRELOAD_BACKEND_ONLY | false | If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups. |
+| --external-grpc-backends | EXTERNAL_GRPC_BACKENDS | none | Comma separated list of external gRPC backends to use. Format: `name:host:port` or `name:/path/to/file` |
+
+
+### Extra backends
+
+LocalAI can be extended with extra backends. The backends are implemented as `gRPC` services and can be written in any language. The container images that are built and published on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags) contain a set of images split in core and extra. By default Images bring all the dependencies and backends supported by LocalAI (we call those `extra` images). The `-core` images instead bring only the strictly necessary dependencies to run LocalAI without only a core set of backends.
+
+If you wish to build a custom container image with extra backends, you can use the core images and build only the backends you are interested into or prepare the environment on startup by using the `EXTRA_BACKENDS` environment variable. For instance, to use the diffusers backend:
+
+```Dockerfile
+FROM quay.io/go-skynet/local-ai:master-ffmpeg-core
+
+RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers
+```
+
+Remember also to set the `EXTERNAL_GRPC_BACKENDS` environment variable (or `--external-grpc-backends` as CLI flag) to point to the backends you are using (`EXTERNAL_GRPC_BACKENDS="backend_name:/path/to/backend"`), for example with diffusers:
+
+```Dockerfile
+FROM quay.io/go-skynet/local-ai:master-ffmpeg-core
+
+RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers
+
+ENV EXTERNAL_GRPC_BACKENDS="diffusers:/build/backend/python/diffusers/run.sh"
+```
+
+{{% alert note %}}
+
+You can specify remote external backends or path to local files. The syntax is `backend-name:/path/to/backend` or `backend-name:host:port`.
+
+{{% /alert %}}
+
+#### In runtime
+
+When using the `-core` container image it is possible to prepare the python backends you are interested into by using the `EXTRA_BACKENDS` variable, for instance:
+
+```bash
+docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master-ffmpeg-core
+```
--- a/docs/content/docs/advanced/fine-tuning.md
+++ b/docs/content/docs/advanced/fine-tuning.md
@ -0,0 +1,136 @@
+
+++
+disableToc = false
+title = "Fine-tuning LLMs for text generation"
+weight = 22
+++
+
+{{% alert note %}}
+Section under construction
+{{% /alert %}}
+
+This section covers how to fine-tune a language model for text generation and consume it in LocalAI.
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mudler/LocalAI/blob/master/examples/e2e-fine-tuning/notebook.ipynb)
+
+## Requirements
+
+For this example you will need at least a 12GB VRAM of GPU and a Linux box.
+
+## Fine-tuning
+
+Fine-tuning a language model is a process that requires a lot of computational power and time.
+
+Currently LocalAI doesn't support the fine-tuning endpoint as LocalAI but there are are [plans](https://github.com/mudler/LocalAI/issues/596) to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).
+
+There is an e2e example of fine-tuning a LLM model to use with [LocalAI](https://github/mudler/LocalAI) written by [@mudler](https://github.com/mudler) available [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/).
+
+The steps involved are:
+
+- Preparing a dataset
+- Prepare the environment and install dependencies
+- Fine-tune the model
+- Merge the Lora base with the model
+- Convert the model to gguf
+- Use the model with LocalAI
+
+## Dataset preparation
+
+We are going to need a dataset or a set of datasets. 
+
+Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the `completion` format which requires the full text to be used for fine-tuning.
+
+A dataset for an instructor model (like Alpaca) can look like the following:
+
+```json
+[
+ {
+    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
+ },
+ {
+    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
+ }
+]
+```
+
+Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):
+
+```
+<System prompt>
+
+## Instruction
+
+<Question, instruction>
+
+## Response
+
+<Expected response from the LLM>
+```
+
+The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the `## Instruction` block, and the model is going to complete the text with the `## Response` block.
+
+Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the `axolotl.yaml` file as `dataset.json`.
+
+### Install dependencies
+
+```bash
+# Install axolotl and dependencies
+git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
+pip install packaging
+pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd
+
+# https://github.com/oobabooga/text-generation-webui/issues/4238
+pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+```
+
+Configure accelerate:
+
+```bash
+accelerate config default
+```
+
+## Fine-tuning
+
+We will need to configure axolotl. In this example is provided a file to use `axolotl.yaml` that uses openllama-3b for fine-tuning. Copy the `axolotl.yaml` file and edit it to your needs. The dataset needs to be next to it as `dataset.json`. You can find the axolotl.yaml file [here](https://github.com/mudler/LocalAI/tree/master/examples/e2e-fine-tuning/).
+
+If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:
+
+```bash
+# Optional pre-tokenize (run only if big dataset)
+python -m axolotl.cli.preprocess axolotl.yaml
+```
+
+Now we are ready to start the fine-tuning process:
+```bash
+# Fine-tune
+accelerate launch -m axolotl.cli.train axolotl.yaml
+```
+
+After we have finished the fine-tuning, we merge the Lora base with the model:
+```bash
+# Merge lora
+python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False
+```
+
+And we convert it to the gguf format that LocalAI can consume:
+
+```bash
+
+# Convert to gguf
+git clone https://github.com/ggerganov/llama.cpp.git
+pushd llama.cpp && make LLAMA_CUBLAS=1 && popd
+
+# We need to convert the pytorch model into ggml for quantization
+# It crates 'ggml-model-f16.bin' in the 'merged' directory.
+pushd llama.cpp && python convert.py --outtype f16 \
+    ../qlora-out/merged/pytorch_model-00001-of-00002.bin && popd
+
+# Start off by making a basic q4_0 4-bit quantization.
+# It's important to have 'ggml' in the name of the quant for some
+# software to recognize it's file format.
+pushd llama.cpp &&  ./quantize ../qlora-out/merged/ggml-model-f16.gguf \
+    ../custom-model-q4_0.bin q4_0
+
+```
+
+Now you should have ended up with a `custom-model-q4_0.bin` file that you can copy in the LocalAI models directory and use it with LocalAI.
--- a/docs/content/docs/faq.md
+++ b/docs/content/docs/faq.md
@ -0,0 +1,57 @@
+
+++
+disableToc = false
+title = "FAQ"
+weight = 24
+icon = "quiz"
+++
+
+## Frequently asked questions
+
+Here are answers to some of the most common questions.
+
+
+### How do I get models? 
+
+Most gguf-based models should work, but newer models may require additions to the API. If a model doesn't work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
+
+### What's the difference with Serge, or XXX?
+
+LocalAI is a multi-model solution that doesn't focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference,  easy to set up locally and deploy to Kubernetes.
+
+### Everything is slow, how is it possible?
+
+There are few situation why this could occur. Some tips are:
+- Don't use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable `mmap` in the model config file so it loads everything in memory.
+- Watch out CPU overbooking. Ideally the `--threads` should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate `<= 4` threads to a model.
+- Run LocalAI with `DEBUG=true`. This gives more information, including stats on the token inference speed.
+- Check that you are actually getting an output: run a simple curl request with `"stream": true` to see how fast the model is responding. 
+
+### Can I use it with a Discord bot, or XXX?
+
+Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!
+
+### Can this leverage GPUs? 
+
+There is GPU support, see {{%relref "docs/features/GPU-acceleration" %}}.
+
+### Where is the webUI? 
+
+There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
+
+### Does it work with AutoGPT? 
+
+Yes, see the [examples](https://github.com/go-skynet/LocalAI/tree/master/examples/)!
+
+### How can I troubleshoot when something is wrong?
+
+Enable the debug mode by setting `DEBUG=true` in the environment variables. This will give you more information on what's going on.
+You can also specify `--debug` in the command line.
+
+### I'm getting 'invalid pitch' error when running with CUDA, what's wrong?
+
+This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.
+
+### I'm getting a 'SIGILL' error, what's wrong?
+
+Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting `REBUILD=true` and disable the CPU instructions that are not compatible with your CPU. For instance: `CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make build`
--- a/docs/content/docs/features/GPU-acceleration.md
+++ b/docs/content/docs/features/GPU-acceleration.md
@ -0,0 +1,109 @@
+++
+disableToc = false
+title = "⚡ GPU acceleration"
+weight = 9
+++
+
+{{% alert context="warning" %}}
+Section under construction
+{{% /alert %}}
+
+This section contains instruction on how to use LocalAI with GPU acceleration.
+
+{{% alert icon="⚡" context="warning" %}}
+For accelleration for AMD or Metal HW there are no specific container images, see the [build]({{%relref "docs/getting-started/build#Acceleration" %}})
+{{% /alert %}}
+
+### CUDA(NVIDIA) acceleration
+
+#### Requirements
+
+Requirement: nvidia-container-toolkit (installation instructions [1](https://www.server-world.info/en/note?os=Ubuntu_22.04&p=nvidia&f=2) [2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
+
+To check what CUDA version do you need, you can either run `nvidia-smi` or `nvcc --version`. 
+
+Alternatively, you can also check nvidia-smi with docker:
+
+```
+docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
+```
+
+To use CUDA, use the images with the `cublas` tag, for example.
+
+The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags):
+
+- CUDA `11` tags: `master-cublas-cuda11`, `v1.40.0-cublas-cuda11`, ...
+- CUDA `12` tags: `master-cublas-cuda12`, `v1.40.0-cublas-cuda12`, ...
+- CUDA `11` + FFmpeg tags: `master-cublas-cuda11-ffmpeg`, `v1.40.0-cublas-cuda11-ffmpeg`, ...
+- CUDA `12` + FFmpeg tags: `master-cublas-cuda12-ffmpeg`, `v1.40.0-cublas-cuda12-ffmpeg`, ...
+
+In addition to the commands to run LocalAI normally, you need to specify `--gpus all` to docker, for example:
+
+```bash
+docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12
+```
+
+If the GPU inferencing is working, you should be able to see something like:
+
+```
+5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
+ggml_init_cublas: found 1 CUDA devices:
+  Device 0: Tesla T4
+llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
+llama_model_load_internal: format     = ggjt v3 (latest)
+llama_model_load_internal: n_vocab    = 32000
+llama_model_load_internal: n_ctx      = 1024
+llama_model_load_internal: n_embd     = 4096
+llama_model_load_internal: n_mult     = 256
+llama_model_load_internal: n_head     = 32
+llama_model_load_internal: n_layer    = 32
+llama_model_load_internal: n_rot      = 128
+llama_model_load_internal: ftype      = 2 (mostly Q4_0)
+llama_model_load_internal: n_ff       = 11008
+llama_model_load_internal: n_parts    = 1
+llama_model_load_internal: model size = 7B
+llama_model_load_internal: ggml ctx size =    0.07 MB
+llama_model_load_internal: using CUDA for GPU acceleration
+llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
+llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
+llama_model_load_internal: offloading 10 repeating layers to GPU
+llama_model_load_internal: offloaded 10/35 layers to GPU
+llama_model_load_internal: total VRAM used: 1598 MB
+...................................................................................................
+llama_init_from_file: kv self size  =  512.00 MB
+```
+
+#### Model configuration
+
+Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for `llama.cpp` workloads a configuration file might look like this (where `gpu_layers` is the number of layers to offload to the GPU):
+
+```yaml
+name: my-model-name
+# Default model parameters
+parameters:
+  # Relative to the models path
+  model: llama.cpp-model.ggmlv3.q5_K_M.bin
+
+context_size: 1024
+threads: 1
+
+f16: true # enable with GPU acceleration
+gpu_layers: 22 # GPU Layers (only used when built with cublas)
+
+```
+
+For diffusers instead, it might look like this instead:
+
+```yaml
+name: stablediffusion
+parameters:
+  model: toonyou_beta6.safetensors
+backend: diffusers
+step: 30
+f16: true
+diffusers:
+  pipeline_type: StableDiffusionPipeline
+  cuda: true
+  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
+  scheduler_type: "k_dpmpp_sde"
+```
--- a/docs/content/docs/features/_index.en.md
+++ b/docs/content/docs/features/_index.en.md
@ -0,0 +1,7 @@
+
+++
+disableToc = false
+title = "Features"
+weight = 8
+icon = "feature_search"
+++
--- a/docs/content/docs/features/audio-to-text.md
+++ b/docs/content/docs/features/audio-to-text.md
@ -0,0 +1,43 @@
+++
+disableToc = false
+title = "🔈 Audio to text"
+weight = 16
+++
+
+Audio to text models are models that can generate text from an audio file.
+
+The transcription endpoint allows to convert audio files to text. The endpoint is based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint input supports all the audio formats supported by `ffmpeg`.
+
+## Usage
+
+Once LocalAI is started and whisper models are installed, you can use the `/v1/audio/transcriptions` API endpoint.
+
+For instance, with cURL:
+
+```bash
+curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"
+```
+
+## Example
+
+Download one of the models from [here](https://huggingface.co/ggerganov/whisper.cpp/tree/main) in the `models` folder, and create a YAML file for your model:
+
+```yaml
+name: whisper-1
+backend: whisper
+parameters:
+  model: whisper-en
+```
+
+The transcriptions endpoint then can be tested like so:
+
+```bash
+## Get an example audio file
+wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
+
+## Send the example audio file to the transcriptions endpoint
+curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"
+
+## Result
+{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
+```
--- a/docs/content/docs/features/constrained_grammars.md
+++ b/docs/content/docs/features/constrained_grammars.md
@ -0,0 +1,30 @@
+
+++
+disableToc = false
+title = "✍️ Constrained grammars"
+weight = 15
+++
+
+The chat endpoint accepts an additional `grammar` parameter which takes a [BNF defined grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form).
+
+This allows the LLM to constrain the output to a user-defined schema, allowing to generate `JSON`, `YAML`, and everything that can be defined with a BNF grammar.
+
+{{% alert note %}}
+This feature works only with models compatible with the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also [Model compatibility]({{%relref "docs/reference/compatibility-table" %}})). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887
+{{% /alert %}}
+
+## Setup
+
+Follow the setup instructions from the [LocalAI functions]({{%relref "docs/features/openai-functions" %}}) page.
+
+## 💡 Usage example
+
+For example, to constrain the output to either `yes`, `no`:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "gpt-4",
+  "messages": [{"role": "user", "content": "Do you like apples?"}],
+  "grammar": "root ::= (\"yes\" | \"no\")"
+}'
+```
--- a/docs/content/docs/features/embeddings.md
+++ b/docs/content/docs/features/embeddings.md
@ -0,0 +1,102 @@
+
+++
+disableToc = false
+title = "🧠 Embeddings"
+weight = 13
+++
+
+LocalAI supports generating embeddings for text or list of tokens.
+
+For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings
+
+## Model compatibility
+
+The embedding endpoint is compatible with `llama.cpp` models, `bert.cpp` models and sentence-transformers models available in huggingface.
+
+## Manual Setup
+
+Create a `YAML` config file in the `models` directory. Specify the `backend` and the model file.
+
+```yaml
+name: text-embedding-ada-002 # The model name used in the API
+parameters:
+  model: <model_file>
+backend: "<backend>"
+embeddings: true
+# .. other parameters
+```
+
+## Bert embeddings
+
+To use `bert.cpp` models you can use the `bert` embedding backend.
+
+An example model config file:
+
+```yaml
+name: text-embedding-ada-002
+parameters:
+  model: bert
+backend: bert-embeddings
+embeddings: true
+# .. other parameters
+```
+
+The `bert` backend uses [bert.cpp](https://github.com/skeskinen/bert.cpp) and uses `ggml` models.
+
+For instance you can download the `ggml` quantized version of `all-MiniLM-L6-v2` from https://huggingface.co/skeskinen/ggml:
+
+```bash
+wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
+```
+
+To test locally (LocalAI server running on `localhost`),
+you can use `curl` (and `jq` at the end to prettify):
+
+```bash
+curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
+  "input": "Your text string goes here",
+  "model": "text-embedding-ada-002"
+}' | jq "."
+```
+
+## Huggingface embeddings
+
+To use `sentence-transformers` and models in `huggingface` you can use the `sentencetransformers` embedding backend.
+
+```yaml
+name: text-embedding-ada-002
+backend: sentencetransformers
+embeddings: true
+parameters:
+  model: all-MiniLM-L6-v2
+```
+
+The `sentencetransformers` backend uses Python [sentence-transformers](https://github.com/UKPLab/sentence-transformers). For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models
+
+{{% alert note %}}
+
+- The `sentencetransformers` backend is an optional backend of LocalAI and uses Python. If you are running `LocalAI` from the containers you are good to go and should be already configured for use.
+- If you are running `LocalAI` manually you must install the python dependencies (`make prepare-extra-conda-environments`). This requires `conda` to be installed.
+- For local execution, you also have to specify the extra backend in the `EXTERNAL_GRPC_BACKENDS` environment variable.
+    - Example: `EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"`
+- The `sentencetransformers` backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the `bert` backend or `llama.cpp`.
+- No models are required to be downloaded before using the `sentencetransformers` backend. The models will be downloaded automatically the first time the API is used.
+
+{{% /alert %}}
+
+## Llama.cpp embeddings
+
+Embeddings with `llama.cpp` are supported with the `llama` backend.
+
+```yaml
+name: my-awesome-model
+backend: llama
+embeddings: true
+parameters:
+  model: ggml-file.bin
+# ...
+```
+
+## 💡 Examples
+
+- Example that uses LLamaIndex and LocalAI as embedding: [here](https://github.com/go-skynet/LocalAI/tree/master/examples/query_data/).
--- a/docs/content/docs/features/gpt-vision.md
+++ b/docs/content/docs/features/gpt-vision.md
@ -0,0 +1,30 @@
+
+++
+disableToc = false
+title = "🆕 GPT Vision"
+weight = 14
+++
+
+{{% alert note %}}
+Available only on `master` builds
+{{% /alert %}}
+
+LocalAI supports understanding images by using [LLaVA](https://llava.hliu.cc/), and implements the [GPT Vision API](https://platform.openai.com/docs/guides/vision) from OpenAI.
+
+![llava](https://github.com/mudler/LocalAI/assets/2420543/cb0a0897-3b58-4350-af66-e6f4387b58d3)
+
+## Usage
+
+OpenAI docs: https://platform.openai.com/docs/guides/vision
+
+To let LocalAI understand and reply with what sees in the image, use the `/v1/chat/completions` endpoint, for example with curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llava",
+     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
+```
+
+### Setup
+
+To setup the LLaVa models, follow the full example in the [configuration examples](https://github.com/mudler/LocalAI/blob/master/examples/configurations/README.md#llava).
--- a/docs/content/docs/features/image-generation.md
+++ b/docs/content/docs/features/image-generation.md
@ -0,0 +1,352 @@
+
+++
+disableToc = false
+title = "🎨 Image generation"
+weight = 12
+++
+
+![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
+(Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))
+
+LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.
+
+## Usage
+
+OpenAI docs: https://platform.openai.com/docs/api-reference/images/create
+
+To generate an image you can send a POST request to the `/v1/images/generations` endpoint with the instruction as the request body:
+
+```bash
+# 512x512 is supported too
+curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
+  "prompt": "A cute baby sea otter",
+  "size": "256x256"
+}'
+```
+
+Available additional parameters: `mode`, `step`.
+
+Note: To set a negative prompt, you can split the prompt with `|`, for instance: `a cute baby sea otter|malformed`.
+
+```bash
+curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
+  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
+  "size": "256x256"
+}'
+```
+
+## Backends
+
+### stablediffusion-cpp
+
+| mode=0                                                                                                                | mode=1 (winograd/sgemm)                                                                                                                |
+|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
+| ![test](https://github.com/go-skynet/LocalAI/assets/2420543/7145bdee-4134-45bb-84d4-f11cb08a5638)                      | ![b643343452981](https://github.com/go-skynet/LocalAI/assets/2420543/abf14de1-4f50-4715-aaa4-411d703a942a)          |
+| ![b6441997879](https://github.com/go-skynet/LocalAI/assets/2420543/d50af51c-51b7-4f39-b6c2-bf04c403894c)              | ![winograd2](https://github.com/go-skynet/LocalAI/assets/2420543/1935a69a-ecce-4afc-a099-1ac28cb649b3)                |
+| ![winograd](https://github.com/go-skynet/LocalAI/assets/2420543/1979a8c4-a70d-4602-95ed-642f382f6c6a)                | ![winograd3](https://github.com/go-skynet/LocalAI/assets/2420543/e6d184d4-5002-408f-b564-163986e1bdfb)                |
+
+Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.
+
+#### Setup
+
+Note: In order to use the `images/generation` endpoint with the `stablediffusion` C++ backend, you need to build LocalAI with `GO_TAGS=stablediffusion`. If you are using the container images, it is already enabled.
+
+{{< tabs >}}
+{{% tab name="Prepare the model in runtime" %}}
+
+While the API is running, you can install the model by using the `/models/apply` endpoint and point it to the `stablediffusion` model in the [models-gallery](https://github.com/go-skynet/model-gallery#image-generation-stable-diffusion):
+
+```bash
+curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
+  "url": "github:go-skynet/model-gallery/stablediffusion.yaml"
+}'
+```
+
+{{% /tab %}}
+{{% tab name="Automatically prepare the model before start" %}}
+
+You can set the `PRELOAD_MODELS` environment variable:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+
+```yaml
+- url: github:go-skynet/model-gallery/stablediffusion.yaml
+```
+
+{{% /tab %}}
+{{% tab name="Install manually" %}}
+
+1. Create a model file `stablediffusion.yaml` in the models folder:
+
+```yaml
+name: stablediffusion
+backend: stablediffusion
+parameters:
+  model: stablediffusion_assets
+```
+
+2. Create a `stablediffusion_assets` directory inside your `models` directory
+3. Download the ncnn assets from https://github.com/EdVince/Stable-Diffusion-NCNN#out-of-box and place them in `stablediffusion_assets`.
+
+The models directory should look like the following:
+
+```bash
+models
+├── stablediffusion_assets
+│   ├── AutoencoderKL-256-256-fp16-opt.param
+│   ├── AutoencoderKL-512-512-fp16-opt.param
+│   ├── AutoencoderKL-base-fp16.param
+│   ├── AutoencoderKL-encoder-512-512-fp16.bin
+│   ├── AutoencoderKL-fp16.bin
+│   ├── FrozenCLIPEmbedder-fp16.bin
+│   ├── FrozenCLIPEmbedder-fp16.param
+│   ├── log_sigmas.bin
+│   ├── tmp-AutoencoderKL-encoder-256-256-fp16.param
+│   ├── UNetModel-256-256-MHA-fp16-opt.param
+│   ├── UNetModel-512-512-MHA-fp16-opt.param
+│   ├── UNetModel-base-MHA-fp16.param
+│   ├── UNetModel-MHA-fp16.bin
+│   └── vocab.txt
+└── stablediffusion.yaml
+```
+
+{{% /tab %}}
+
+{{< /tabs >}}
+
+### Diffusers
+
+[Diffusers](https://huggingface.co/docs/diffusers/index) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the `diffusers` library.
+
+![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
+(Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))
+
+#### Model setup
+
+The models will be downloaded the first time you use the backend from `huggingface` automatically.
+
+Create a model configuration file in the `models` directory, for instance to use `Linaqruf/animagine-xl` with CPU:
+
+```yaml
+name: animagine-xl
+parameters:
+  model: Linaqruf/animagine-xl
+backend: diffusers
+
+# Force CPU usage - set to true for GPU
+f16: false
+diffusers:
+  cuda: false # Enable for GPU usage (CUDA)
+  scheduler_type: euler_a
+```
+
+#### Dependencies
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use *core* images (ending with `-core`). If you are building manually, see the [build instructions]({{%relref "docs/getting-started/build" %}}).
+
+#### Model setup
+
+The models will be downloaded the first time you use the backend from `huggingface` automatically.
+
+Create a model configuration file in the `models` directory, for instance to use `Linaqruf/animagine-xl` with CPU:
+
+```yaml
+name: animagine-xl
+parameters:
+  model: Linaqruf/animagine-xl
+backend: diffusers
+cuda: true
+f16: true
+diffusers:
+  scheduler_type: euler_a
+```
+
+#### Local models
+
+You can also use local models, or modify some parameters like `clip_skip`, `scheduler_type`, for instance:
+
+```yaml
+name: stablediffusion
+parameters:
+  model: toonyou_beta6.safetensors
+backend: diffusers
+step: 30
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
+  scheduler_type: "k_dpmpp_sde"
+  cfg_scale: 8
+  clip_skip: 11
+```
+
+#### Configuration parameters
+
+The following parameters are available in the configuration file:
+
+| Parameter | Description | Default |
+| --- | --- | --- |
+| `f16` | Force the usage of `float16` instead of `float32` | `false` |
+| `step` | Number of steps to run the model for | `30` |
+| `cuda` | Enable CUDA acceleration | `false` |
+| `enable_parameters` | Parameters to enable for the model | `negative_prompt,num_inference_steps,clip_skip` |
+| `scheduler_type` | Scheduler type | `k_dpp_sde` |
+| `cfg_scale` | Configuration scale | `8` |
+| `clip_skip` | Clip skip | None |
+| `pipeline_type` | Pipeline type | `AutoPipelineForText2Image` |
+
+There are available several types of schedulers:
+
+| Scheduler | Description |
+| --- | --- |
+| `ddim` | DDIM |
+| `pndm` | PNDM |
+| `heun` | Heun |
+| `unipc` | UniPC |
+| `euler` | Euler |
+| `euler_a` | Euler a |
+| `lms` | LMS |
+| `k_lms` | LMS Karras |
+| `dpm_2` | DPM2 |
+| `k_dpm_2` | DPM2 Karras |
+| `dpm_2_a` | DPM2 a |
+| `k_dpm_2_a` | DPM2 a Karras |
+| `dpmpp_2m` | DPM++ 2M |
+| `k_dpmpp_2m` | DPM++ 2M Karras |
+| `dpmpp_sde` | DPM++ SDE |
+| `k_dpmpp_sde` | DPM++ SDE Karras |
+| `dpmpp_2m_sde` | DPM++ 2M SDE |
+| `k_dpmpp_2m_sde` | DPM++ 2M SDE Karras |
+
+Pipelines types available:
+
+| Pipeline type | Description |
+| --- | --- |
+| `StableDiffusionPipeline` | Stable diffusion pipeline |
+| `StableDiffusionImg2ImgPipeline` | Stable diffusion image to image pipeline |
+| `StableDiffusionDepth2ImgPipeline` | Stable diffusion depth to image pipeline |
+| `DiffusionPipeline` | Diffusion pipeline |
+| `StableDiffusionXLPipeline` | Stable diffusion XL pipeline |
+
+#### Usage
+
+#### Text to Image
+Use the `image` generation endpoint with the `model` name from the configuration file:
+
+```bash
+curl http://localhost:8080/v1/images/generations \
+    -H "Content-Type: application/json" \
+    -d '{
+      "prompt": "<positive prompt>|<negative prompt>", 
+      "model": "animagine-xl", 
+      "step": 51,
+      "size": "1024x1024" 
+    }'
+```
+
+#### Image to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/img2img
+
+An example model (GPU):
+```yaml
+name: stablediffusion-edit
+parameters:
+  model: nitrosocke/Ghibli-Diffusion
+backend: diffusers
+step: 25
+cuda: true
+f16: true
+diffusers:
+  pipeline_type: StableDiffusionImg2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+```
+
+```bash
+IMAGE_PATH=/path/to/your/image
+(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+#### Depth to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/depth2img
+
+```yaml
+name: stablediffusion-depth
+parameters:
+  model: stabilityai/stable-diffusion-2-depth
+backend: diffusers
+step: 50
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionDepth2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+  cfg_scale: 6
+```
+
+```bash
+(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+#### img2vid
+
+
+```yaml
+name: img2vid
+parameters:
+  model: stabilityai/stable-video-diffusion-img2vid
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableVideoDiffusionPipeline
+```
+
+```bash
+(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
+
+#### txt2vid
+
+```yaml
+name: txt2vid
+parameters:
+  model: damo-vilab/text-to-video-ms-1.7b
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: VideoDiffusionPipeline
+  cuda: true
+```
+
+```bash
+(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
--- a/docs/content/docs/features/model-gallery.md
+++ b/docs/content/docs/features/model-gallery.md
@ -0,0 +1,508 @@
+
+++
+disableToc = false
+title = "🖼️ Model gallery"
+
+weight = 18
+url = '/models'
+++
+
+<h1 align="center">
+  <br>
+  <img height="300" src="https://github.com/go-skynet/model-gallery/assets/2420543/7a6a8183-6d0a-4dc4-8e1d-f2672fab354e"> <br>
+<br>
+</h1>
+
+The model gallery is a (experimental!) collection of models configurations for [LocalAI](https://github.com/go-skynet/LocalAI).
+
+LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the `models` directory, or use the API to configure, download and verify the model assets for you. As the UI is still a work in progress, you will find here the documentation about the API Endpoints.
+
+{{% alert note %}}
+The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
+{{% /alert %}}
+
+{{% alert note %}}
+GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
+{{% /alert %}}
+
+## Useful Links and resources
+
+- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the `gguf` format.
+
+
+## Model repositories
+
+You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.
+
+To install a model in runtime you will need to use the `/models/apply` LocalAI API endpoint.
+
+To enable the `model-gallery` repository you need to start `local-ai` with the `GALLERIES` environment variable:
+
+```
+GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]
+```
+
+For example, to enable the `model-gallery` repository, start `local-ai` with:
+
+```
+GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]
+```
+
+where `github:go-skynet/model-gallery/index.yaml` will be expanded automatically to `https://raw.githubusercontent.com/go-skynet/model-gallery/main/index.yaml`.
+
+{{% alert note %}}
+
+As this feature is experimental, you need to run `local-ai` with a list of `GALLERIES`. Currently there are two galleries:
+
+- An official one, containing only definitions and models with a clear LICENSE to avoid any dmca infringment. As I'm not sure what's the best action to do in this case, I'm not going to include any model that is not clearly licensed in this repository which is offically linked to LocalAI.
+- A "community" one that contains an index of `huggingface` models that are compatible with the `ggml` format and lives in the `localai-huggingface-zoo` repository.
+
+To enable the two repositories, start `LocalAI` with the `GALLERIES` environment variable:
+
+```bash
+GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
+```
+
+If running with `docker-compose`, simply edit the `.env` file and uncomment the `GALLERIES` variable, and add the one you want to use.
+
+{{% /alert %}}
+
+{{% alert note %}}
+You might not find all the models in this gallery. Automated CI updates the gallery automatically. You can find however most of the models on huggingface (https://huggingface.co/), generally it should be available `~24h` after upload.
+
+By under any circumstances LocalAI and any developer is not responsible for the models in this gallery, as CI is just indexing them and providing a convenient way to install with an automatic configuration with a consistent API. Don't install models from authors you don't trust, and, check the appropriate license for your use case. Models are automatically indexed and hosted on huggingface (https://huggingface.co/). For any issue with the models, please open an issue on the model gallery repository if it's a LocalAI misconfiguration, otherwise refer to the huggingface repository. If you think a model should not be listed, please reach to us and we will remove it from the gallery.
+{{% /alert %}}
+
+{{% alert note %}}
+
+There is no documentation yet on how to build a gallery or a repository - but you can find an example in the [model-gallery](https://github.com/go-skynet/model-gallery) repository.
+
+{{% /alert %}}
+
+
+### List Models
+
+To list all the available models, use the `/models/available` endpoint:
+
+```bash
+curl http://localhost:8080/models/available
+```
+
+To search for a model, you can use `jq`:
+
+```bash
+# Get all information about models with a name that contains "replit"
+curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'
+
+# Get the binary name of all local models (not hosted on Hugging Face)
+curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'
+
+# Get all of the model URLs that contains "orca"
+curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'
+```
+
+### How to install a model from the repositories
+
+Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.
+
+To install a model from the gallery repository, you can pass the model name in the `id` field. For instance, to install the `bert-embeddings` model, you can use the following command:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "model-gallery@bert-embeddings"
+   }'  
+```
+
+where:
+- `model-gallery` is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
+- `bert-embeddings` is the model name in the gallery
+  (read its [config here](https://github.com/go-skynet/model-gallery/blob/main/bert-embeddings.yaml)).
+
+{{% alert note %}}
+If the `huggingface` model gallery is enabled (it's enabled by default),
+and the model has an entry in the model gallery's associated YAML config
+(for `huggingface`, see [`model-gallery/huggingface.yaml`](https://github.com/go-skynet/model-gallery/blob/main/huggingface.yaml)),
+you can install models by specifying directly the model's `id`.
+For example, to install wizardlm superhot:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "huggingface@TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GGML/wizardlm-13b-v1.0-superhot-8k.ggmlv3.q4_K_M.bin"
+   }'  
+```
+
+Note that the `id` can be used similarly when pre-loading models at start.
+{{% /alert %}}
+
+
+## How to install a model (without a gallery)
+
+If you don't want to set any gallery repository, you can still install models by loading a model configuration file.
+
+In the body of the request you must specify the model configuration file URL (`url`), optionally a name to install the model (`name`), extra files to install (`files`), and configuration overrides (`overrides`). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>"
+   }' 
+# or if from a repository
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "<GALLERY>@<MODEL_NAME>"
+   }' 
+```
+
+An example that installs openllama can be:
+   
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "https://github.com/go-skynet/model-gallery/blob/main/openllama_3b.yaml"
+   }'  
+```
+
+The API will return a job `uuid` that you can use to track the job progress:
+```
+{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}
+```
+
+For instance, a small example bash script that waits a job to complete can be (requires `jq`):
+
+```bash
+response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')
+
+job_id=$(echo "$response" | jq -r '.uuid')
+
+while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
+  sleep 1
+done
+
+echo "Job completed"
+```
+
+To preload models on start instead you can use the `PRELOAD_MODELS` environment variable.
+
+<details>
+
+To preload models on start, use the `PRELOAD_MODELS` environment variable by setting it to a JSON array of model uri:
+
+```bash
+PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'
+```
+
+Note: `url` or `id` must be specified. `url` is used to a url to a model gallery configuration, while an `id` is used to refer to models inside repositories. If both are specified, the `id` will be used.
+
+For example:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:go-skynet/model-gallery/stablediffusion.yaml
+```
+
+</details>
+
+{{% alert note %}}
+
+You can find already some open licensed models in the [model gallery](https://github.com/go-skynet/model-gallery).
+
+If you don't find the model in the gallery you can try to use the "base" model and provide an URL to LocalAI:
+
+<details>
+
+```
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "github:go-skynet/model-gallery/base.yaml",
+     "name": "model-name",
+     "files": [
+        {
+            "uri": "<URL>",
+            "sha256": "<SHA>",
+            "filename": "model"
+        }
+     ]
+   }'
+```
+
+</details>
+
+{{% /alert %}}
+
+## Installing a model with a different name
+
+To install a model with a different name, specify a `name` parameter in the request body.
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>"
+   }'  
+```
+
+For example, to install a model as `gpt-3.5-turbo`:
+   
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+      "url": "github:go-skynet/model-gallery/gpt4all-j.yaml",
+      "name": "gpt-3.5-turbo"
+   }'  
+```
+## Additional Files
+
+<details>
+
+To download additional files with the model, use the `files` parameter:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>",
+     "files": [
+        {
+            "uri": "<additional_file_url>",
+            "sha256": "<additional_file_hash>",
+            "filename": "<additional_file_name>"
+        }
+     ]
+   }'  
+```
+
+</details>
+
+## Overriding configuration files
+
+<details>
+
+To override portions of the configuration file, such as the backend or the model file, use the `overrides` parameter:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>",
+     "overrides": {
+        "backend": "llama",
+        "f16": true,
+        ...
+     }
+   }'  
+```
+
+</details>
+
+
+
+## Examples
+
+### Embeddings: Bert
+
+<details>
+
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "github:go-skynet/model-gallery/bert-embeddings.yaml",
+     "name": "text-embedding-ada-002"
+   }'  
+```
+
+To test it:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
+    "input": "Test",
+    "model": "text-embedding-ada-002"
+  }'
+```
+
+</details>
+
+### Image generation: Stable diffusion
+
+URL: https://github.com/EdVince/Stable-Diffusion-NCNN
+
+{{< tabs >}}
+{{% tab name="Prepare the model in runtime" %}}
+
+While the API is running, you can install the model by using the `/models/apply` endpoint and point it to the `stablediffusion` model in the [models-gallery](https://github.com/go-skynet/model-gallery#image-generation-stable-diffusion):
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
+     "url": "github:go-skynet/model-gallery/stablediffusion.yaml"
+   }'
+```
+
+{{% /tab %}}
+{{% tab name="Automatically prepare the model before start" %}}
+
+You can set the `PRELOAD_MODELS` environment variable:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/stablediffusion.yaml"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:go-skynet/model-gallery/stablediffusion.yaml
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+Test it:
+
+```
+curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
+            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
+            "mode": 2,  "seed":9000,
+            "size": "256x256", "n":2
+}'
+```
+
+### Audio transcription: Whisper
+
+URL: https://github.com/ggerganov/whisper.cpp
+
+{{< tabs >}}
+{{% tab name="Prepare the model in runtime" %}}
+
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
+     "url": "github:go-skynet/model-gallery/whisper-base.yaml",
+     "name": "whisper-1"
+   }'
+```
+
+{{% /tab %}}
+{{% tab name="Automatically prepare the model before start" %}}
+
+You can set the `PRELOAD_MODELS` environment variable:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/whisper-base.yaml", "name": "whisper-1"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:go-skynet/model-gallery/whisper-base.yaml", "name": "whisper-1"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:go-skynet/model-gallery/whisper-base.yaml
+  name: whisper-1
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+### GPTs
+
+<details>
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "github:go-skynet/model-gallery/gpt4all-j.yaml",
+     "name": "gpt4all-j"
+   }'  
+```
+
+To test it:
+
+```
+curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "gpt4all-j", 
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.1 
+   }'
+```
+
+</details>
+
+### Note
+
+LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model. 
+
+Input: `url` or `id` (required), `name` (optional), `files` (optional)
+
+```bash
+curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_DEFINITION_URL>",
+     "id": "<GALLERY>@<MODEL_NAME>",
+     "name": "<INSTALLED_MODEL_NAME>",
+     "files": [
+        {
+            "uri": "<additional_file>",
+            "sha256": "<additional_file_hash>",
+            "filename": "<additional_file_name>"
+        },
+      "overrides": { "backend": "...", "f16": true }
+     ]
+   }
+```
+
+An optional, list of additional files can be specified to be downloaded within `files`. The `name` allows to override the model name. Finally it is possible to override the model config file with `override`.
+
+The `url` is a full URL, or a github url (`github:org/repo/file.yaml`), or a local file (`file:///path/to/file.yaml`).
+The `id` is a string in the form `<GALLERY>@<MODEL_NAME>`, where `<GALLERY>` is the name of the gallery, and `<MODEL_NAME>` is the name of the model in the gallery. Galleries can be specified during startup with the `GALLERIES` environment variable.
+
+Returns an `uuid` and an `url` to follow up the state of the process:
+
+```json
+{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}
+```
+
+To see a collection example of curated models definition files, see the [model-gallery](https://github.com/go-skynet/model-gallery).
+
+#### Get model job state `/models/jobs/<uid>`
+
+This endpoint returns the state of the batch job associated to a model installation.
+
+```bash
+curl http://localhost:8080/models/jobs/<JOB_ID>
+```
+
+Returns a json containing the error, and if the job is being processed:
+
+```json
+{"error":null,"processed":true,"message":"completed"}
+```
--- a/docs/content/docs/features/openai-functions.md
+++ b/docs/content/docs/features/openai-functions.md
@ -0,0 +1,126 @@
+
+++
+disableToc = false
+title = "🔥 OpenAI functions"
+weight = 17
+++
+
+LocalAI supports running OpenAI functions with `llama.cpp` compatible models.
+
+![localai-functions-1](https://github.com/ggerganov/llama.cpp/assets/2420543/5bd15da2-78c1-4625-be90-1e938e6823f1)
+
+To learn more about OpenAI functions, see the [OpenAI API blog post](https://openai.com/blog/function-calling-and-other-api-updates).
+
+💡 Check out also [LocalAGI](https://github.com/mudler/LocalAGI) for an example on how to use LocalAI functions.
+
+## Setup
+
+OpenAI functions are available only with `ggml` or `gguf` models compatible with `llama.cpp`.
+
+You don't need to do anything specific - just use `ggml` or `gguf` models.
+
+
+## Usage example
+
+You can configure a model manually with a YAML config file in the models directory, for example:
+
+```yaml
+name: gpt-3.5-turbo
+parameters:
+  # Model file name
+  model: ggml-openllama.bin
+  top_p: 80
+  top_k: 0.9
+  temperature: 0.1
+```
+
+To use the functions with the OpenAI client in python:
+
+```python
+import openai
+# ...
+# Send the conversation and available functions to GPT
+messages = [{"role": "user", "content": "What's the weather like in Boston?"}]
+functions = [
+    {
+        "name": "get_current_weather",
+        "description": "Get the current weather in a given location",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "The city and state, e.g. San Francisco, CA",
+                },
+                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+            },
+            "required": ["location"],
+        },
+    }
+]
+response = openai.ChatCompletion.create(
+    model="gpt-3.5-turbo",
+    messages=messages,
+    functions=functions,
+    function_call="auto",
+)
+# ...
+```
+
+{{% alert note %}}
+When running the python script, be sure to:
+
+- Set `OPENAI_API_KEY` environment variable to a random string (the OpenAI api key is NOT required!)
+- Set `OPENAI_API_BASE` to point to your LocalAI service, for example `OPENAI_API_BASE=http://localhost:8080`
+
+{{% /alert %}}
+
+## Advanced
+
+It is possible to also specify the full function signature (for debugging, or to use with other clients).
+
+The chat endpoint accepts the `grammar_json_functions` additional parameter which takes a JSON schema object.
+
+For example, with curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "gpt-4",
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.1,
+     "grammar_json_functions": {
+        "oneOf": [
+            {
+                "type": "object",
+                "properties": {
+                    "function": {"const": "create_event"},
+                    "arguments": {
+                        "type": "object",
+                        "properties": {
+                            "title": {"type": "string"},
+                            "date": {"type": "string"},
+                            "time": {"type": "string"}
+                        }
+                    }
+                }
+            },
+            {
+                "type": "object",
+                "properties": {
+                    "function": {"const": "search"},
+                    "arguments": {
+                        "type": "object",
+                        "properties": {
+                            "query": {"type": "string"}
+                        }
+                    }
+                }
+            }
+        ]
+    }
+   }'
+```
+
+## 💡 Examples
+
+A full e2e example with `docker-compose` is available [here](https://github.com/go-skynet/LocalAI/tree/master/examples/functions).
--- a/docs/content/docs/features/text-generation.md
+++ b/docs/content/docs/features/text-generation.md
@ -0,0 +1,263 @@
+
+++
+disableToc = false
+title = "📖 Text generation (GPT)"
+weight = 10
+++
+
+LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "docs/reference/compatibility-table" %}}) for an up-to-date list of the supported model families.
+
+Note:
+
+- You can also specify the model name as part of the OpenAI token.
+- If only one model is available, the API will use it for all the requests.
+
+## API Reference
+
+### Chat completions
+
+https://platform.openai.com/docs/api-reference/chat
+
+For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "messages": [{"role": "user", "content": "Say this is a test!"}],
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### Edit completions
+
+https://platform.openai.com/docs/api-reference/edits
+
+To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "instruction": "rephrase",
+  "input": "Black cat jumped out of the window",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`.
+
+### Completions
+
+https://platform.openai.com/docs/api-reference/completions
+
+To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:
+
+```bash
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "prompt": "A long time ago in a galaxy far, far away",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### List models
+
+You can list all the models available with:
+
+```bash
+curl http://localhost:8080/v1/models
+```
+
+## Backends
+
+### AutoGPTQ
+
+[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
+
+#### Prerequisites
+
+This is an extra backend - in the container images is already available and there is nothing to do for the setup.
+
+If you are building LocalAI locally, you need to install [AutoGPTQ manually](https://github.com/PanQiWei/AutoGPTQ#quick-installation).
+
+
+#### Model setup
+
+The models are automatically downloaded from `huggingface` if not present the first time. It is possible to define models via `YAML` config file, or just by querying the endpoint with the `huggingface` repository model name. For example, create a `YAML` config file in `models/`:
+
+```
+name: orca
+backend: autogptq
+model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
+parameters:
+  model: "TheBloke/orca_mini_v2_13b-GPTQ"
+# ...
+```
+
+Test with:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
+   "model": "orca",
+   "messages": [{"role": "user", "content": "How are you?"}],
+   "temperature": 0.1
+ }'
+```
+### RWKV
+
+A full example on how to run a rwkv model is in the [examples](https://github.com/go-skynet/LocalAI/tree/master/examples/rwkv).
+
+Note: rwkv models needs to specify the backend `rwkv` in the YAML config files and have an associated tokenizer along that needs to be provided with it:
+
+```
+36464540 -rw-r--r--  1 mudler mudler 1.2G May  3 10:51 rwkv_small
+36464543 -rw-r--r--  1 mudler mudler 2.4M May  3 10:51 rwkv_small.tokenizer.json
+```
+
+### llama.cpp
+
+[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.
+
+{{% alert note %}}
+
+The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-ggml` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. The go backend supports still features not available in the mainline: speculative sampling and embeddings.
+
+{{% /alert %}}
+
+#### Features
+
+The `llama.cpp` model supports the following features:
+- [📖 Text generation (GPT)]({{%relref "docs/features/text-generation" %}})
+- [🧠 Embeddings]({{%relref "docs/features/embeddings" %}})
+- [🔥 OpenAI functions]({{%relref "docs/features/openai-functions" %}})
+- [✍️ Constrained grammars]({{%relref "docs/features/constrained_grammars" %}})
+
+#### Setup
+
+LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. 
+
+##### Manual setup
+
+It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.
+
+[You can optionally create an associated YAML]({{%relref "docs/advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.
+
+Prompt templates are useful for models that are fine-tuned towards a specific prompt. 
+
+##### Automatic setup
+
+LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.
+
+For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.1
+   }'
+```
+
+LocalAI will automatically download and configure the model in the `model` directory.
+
+Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "docs/features/model-gallery" %}}).
+
+#### YAML configuration
+
+To use the `llama.cpp` backend, specify `llama` as the backend in the YAML file:
+
+```yaml
+name: llama
+backend: llama
+parameters:
+  # Relative to the models path
+  model: file.gguf.bin
+```
+
+In the example above we specify `llama` as the backend to restrict loading `gguf` models only. 
+
+For instance, to use the `llama-ggml` backend for `ggml` models:
+
+```yaml
+name: llama
+backend: llama-ggml
+parameters:
+  # Relative to the models path
+  model: file.ggml.bin
+```
+
+#### Reference
+
+- [llama](https://github.com/ggerganov/llama.cpp)
+- [binding](https://github.com/go-skynet/go-llama.cpp)
+
+
+### exllama/2
+
+[Exllama](https://github.com/turboderp/exllama) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". Both `exllama` and `exllama2` are supported.
+
+#### Model setup
+
+Download the model as a folder inside the `model ` directory and create a YAML file specifying the `exllama` backend. For instance with the `TheBloke/WizardLM-7B-uncensored-GPTQ` model:
+
+```
+$ git lfs install
+$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
+$ ls models/                                                                 
+.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
+$ cat models/exllama.yaml                                                     
+name: exllama
+parameters:
+  model: WizardLM-7B-uncensored-GPTQ
+backend: exllama
+# Note: you can also specify "exllama2" if it's an exllama2 model here
+# ...
+```
+
+Test with:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
+   "model": "exllama",
+   "messages": [{"role": "user", "content": "How are you?"}],
+   "temperature": 0.1
+ }'
+```
+
+### vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
+
+LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).
+
+#### Setup
+
+Create a YAML file for the model you want to use with `vllm`.
+
+To setup a model, you need to just specify the model name in the YAML config file:
+```yaml
+name: vllm
+backend: vllm
+parameters:
+    model: "facebook/opt-125m"
+
+# Decomment to specify a quantization method (optional)
+# quantization: "awq"
+```
+
+The backend will automatically download the required files in order to run the model.
+
+
+#### Usage
+
+Use the `completions` endpoint by specifying the `vllm` backend:
+```
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
+   "model": "vllm",
+   "prompt": "Hello, my name is",
+   "temperature": 0.1, "top_p": 0.1
+ }'
+```
--- a/docs/content/docs/features/text-to-audio.md
+++ b/docs/content/docs/features/text-to-audio.md
@ -0,0 +1,158 @@
+
+++
+disableToc = false
+title = "🗣 Text to audio (TTS)"
+weight = 11
+++
+
+The `/tts` endpoint can be used to generate speech from text.
+
+## Usage
+
+Input: `input`, `model`
+
+For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "input": "Hello world",
+  "model": "tts"
+}'
+```
+
+Returns an `audio/wav` file.
+
+
+## Backends
+
+### 🐸 Coqui
+
+Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend.
+
+Coqui works without any configuration, to test it, you can run the following curl command:
+
+```
+    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+        "backend": "coqui",
+        "model": "tts_models/en/ljspeech/glow-tts",
+        "input":"Hello, this is a test!"
+        }'
+```
+
+### Bark
+
+[Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup.
+
+#### Model setup
+
+There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
+
+#### Usage
+
+Use the `tts` endpoint by specifying the `bark` backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!"
+   }' | aplay
+```
+
+To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!",
+     "model": "v2/en_speaker_4"
+   }' | aplay
+```
+
+### Piper
+
+To install the `piper` audio models manually:
+
+- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
+- Extract the `.tar.tgz` files (.onnx,.json) inside `models`
+- Run the following command to test the model is working
+
+To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend:
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "model":"it-riccardo_fasol-x-low.onnx",
+  "backend": "piper",
+  "input": "Ciao, sono Ettore"
+}' | aplay
+```
+
+Note:
+
+- `aplay` is a Linux command. You can use other tools to play the audio file.
+- The model name is the filename with the extension.
+- The model name is case sensitive.
+- LocalAI must be compiled with the `GO_TAGS=tts` flag.
+
+### Transformers-musicgen
+
+LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
+
+```
+curl --request POST \
+  --url http://localhost:8080/tts \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "backend": "transformers-musicgen",
+    "model": "facebook/musicgen-medium",
+    "input": "Cello Rave"
+}' | aplay
+```
+
+Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
+
+### Vall-E-X
+
+[VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model.
+
+#### Setup
+
+The backend will automatically download the required files in order to run the model.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
+
+#### Usage
+
+Use the tts endpoint by specifying the vall-e-x backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "vall-e-x",
+     "input":"Hello!"
+   }' | aplay
+```
+
+#### Voice cloning
+
+In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model:
+
+```yaml
+name: cloned-voice
+backend: vall-e-x
+parameters:
+  model: "cloned-voice"
+vall-e:
+  # The path to the audio file to be cloned
+  # relative to the models directory 
+  audio_path: "path-to-wav-source.wav"
+```
+
+Then you can specify the model name in the requests:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "vall-e-x",
+     "model": "cloned-voice",
+     "input":"Hello!"
+   }' | aplay
+```
--- a/docs/content/docs/getting-started/_index.en.md
+++ b/docs/content/docs/getting-started/_index.en.md
@ -0,0 +1,7 @@
+
+++
+disableToc = false
+title = "Getting started"
+weight = 2
+icon = "rocket_launch"
+++
--- a/docs/content/docs/getting-started/build.md
+++ b/docs/content/docs/getting-started/build.md
@ -0,0 +1,261 @@
+
+++
+disableToc = false
+title = "Build LocalAI from source"
+weight = 6
+url = '/basics/build/'
+ico = "rocket_launch"
+++
+
+### Build
+
+LocalAI can be built as a container image or as a single, portable binary. Note that the some model architectures might require Python libraries, which are not included in the binary. The binary contains only the core backends written in Go and C++. 
+
+LocalAI's extensible architecture allows you to add your own backends, which can be written in any language, and as such the container images contains also the Python dependencies to run all the available backends (for example, in order to run backends like __Diffusers__ that allows to generate images and videos from text).
+
+In some cases you might want to re-build LocalAI from source (for instance to leverage Apple Silicon acceleration), or to build a custom container image with your own backends. This section contains instructions on how to build LocalAI from source.
+
+#### Container image
+
+Requirements:
+
+- Docker or podman, or a container engine
+
+In order to build the `LocalAI` container image locally you can use `docker`, for example:
+
+```
+# build the image
+docker build -t localai .
+docker run localai
+```
+
+#### Build LocalAI locally
+
+##### Requirements
+
+In order to build LocalAI locally, you need the following requirements:
+
+- Golang >= 1.21
+- Cmake/make
+- GCC
+- GRPC
+
+To install the dependencies follow the instructions below:
+
+{{< tabs tabTotal="3"  >}}
+{{% tab tabName="Apple" %}}
+
+```bash
+brew install abseil cmake go grpc protobuf wget
+```
+
+{{% /tab %}}
+{{% tab tabName="Debian" %}}
+
+```bash
+apt install golang protobuf-compiler-grpc libgrpc-dev make cmake
+```
+
+{{% /tab %}}
+{{% tab tabName="From source" %}}
+
+Specify `BUILD_GRPC_FOR_BACKEND_LLAMA=true` to build automatically the gRPC dependencies
+
+```bash
+make ... BUILD_GRPC_FOR_BACKEND_LLAMA=true build
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+##### Build
+To build LocalAI with `make`:
+
+```
+git clone https://github.com/go-skynet/LocalAI
+cd LocalAI
+make build
+```
+
+This should produce the binary `local-ai`
+
+{{% alert note %}}
+
+#### CPU flagset compatibility
+
+
+LocalAI uses different backends based on ggml and llama.cpp to run models. If your CPU doesn't support common instruction sets, you can disable them during build:
+
+```
+CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" make build
+```
+
+To have effect on the container image, you need to set `REBUILD=true`:
+
+```
+docker run  quay.io/go-skynet/localai
+docker run --rm -ti -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -e REBUILD=true -e CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" -v $PWD/models:/models quay.io/go-skynet/local-ai:latest
+```
+
+{{% /alert %}}
+
+### Example: Build on mac
+
+Building on Mac (M1 or M2) works, but you may need to install some prerequisites using `brew`. 
+
+The below has been tested by one mac user and found to work. Note that this doesn't use Docker to run the server:
+
+```
+# install build dependencies
+brew install abseil cmake go grpc protobuf wget
+
+# clone the repo
+git clone https://github.com/go-skynet/LocalAI.git
+
+cd LocalAI
+
+# build the binary
+make build
+
+# Download gpt4all-j to models/
+wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
+
+# Use a template from the examples
+cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/
+
+# Run LocalAI
+./local-ai --models-path=./models/ --debug=true
+
+# Now API is accessible at localhost:8080
+curl http://localhost:8080/v1/models
+
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "ggml-gpt4all-j",
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.9 
+   }'
+```
+
+### Build with Image generation support
+
+
+**Requirements**: OpenCV, Gomp
+
+Image generation requires `GO_TAGS=stablediffusion` or `GO_TAGS=tinydream` to be set during build:
+
+```
+make GO_TAGS=stablediffusion build
+```
+
+### Build with Text to audio support
+
+**Requirements**: piper-phonemize
+
+Text to audio support is experimental and requires `GO_TAGS=tts` to be set during build:
+
+```
+make GO_TAGS=tts build
+```
+
+### Acceleration
+
+List of the variables available to customize the build:
+
+| Variable | Default | Description |
+| ---------------------| ------- | ----------- |
+| `BUILD_TYPE`         |   None      | Build type. Available: `cublas`, `openblas`, `clblas`, `metal`,`hipblas` |
+| `GO_TAGS`            |   `tts stablediffusion`      | Go tags. Available: `stablediffusion`, `tts`, `tinydream` |
+| `CLBLAST_DIR`        |         | Specify a CLBlast directory |
+| `CUDA_LIBPATH`       |         | Specify a CUDA library path |
+
+#### OpenBLAS
+
+Software acceleration.
+
+Requirements: OpenBLAS
+
+```
+make BUILD_TYPE=openblas build
+```
+
+#### CuBLAS
+
+Nvidia Acceleration.
+
+Requirement: Nvidia CUDA toolkit
+
+Note: CuBLAS support is experimental, and has not been tested on real HW. please report any issues you find!
+
+```
+make BUILD_TYPE=cublas build
+```
+
+More informations available in the upstream PR: https://github.com/ggerganov/llama.cpp/pull/1412
+
+
+#### Hipblas (AMD GPU with ROCm on Arch Linux)
+
+Packages:
+```
+pacman -S base-devel git rocm-hip-sdk rocm-opencl-sdk opencv clblast grpc
+```
+
+Library links:
+```
+export CGO_CFLAGS="-I/usr/include/opencv4"
+export CGO_CXXFLAGS="-I/usr/include/opencv4"
+export CGO_LDFLAGS="-L/opt/rocm/hip/lib -lamdhip64 -L/opt/rocm/lib -lOpenCL -L/usr/lib -lclblast -lrocblas -lhipblas -lrocrand -lomp -O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link"
+```
+
+Build:
+```
+make BUILD_TYPE=hipblas GPU_TARGETS=gfx1030
+```
+
+#### ClBLAS
+
+AMD/Intel GPU acceleration.
+
+Requirement: OpenCL, CLBlast
+
+```
+make BUILD_TYPE=clblas build
+```
+
+To specify a clblast dir set: `CLBLAST_DIR`
+
+#### Metal (Apple Silicon)
+
+```
+make BUILD_TYPE=metal build
+
+# Set `gpu_layers: 1` to your YAML model config file and `f16: true`
+# Note: only models quantized with q4_0 are supported!
+```
+
+
+### Windows compatibility
+
+Make sure to give enough resources to the running container. See https://github.com/go-skynet/LocalAI/issues/2
+
+### Examples
+
+More advanced build options are available, for instance to build only a single backend.
+
+#### Build only a single backend
+
+You can control the backends that are built by setting the `GRPC_BACKENDS` environment variable. For instance, to build only the `llama-cpp` backend only:
+
+```bash
+make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build
+```
+
+By default, all the backends are built.
+
+#### Specific llama.cpp version
+
+To build with a specific version of llama.cpp, set `CPPLLAMA_VERSION` to the tag or wanted sha:
+
+```
+CPPLLAMA_VERSION=<sha> make build
+```
--- a/docs/content/docs/getting-started/customize-model.md
+++ b/docs/content/docs/getting-started/customize-model.md
@ -0,0 +1,71 @@
+++
+disableToc = false
+title = "Customizing the Model"
+weight = 4
+icon = "rocket_launch"
+
+++
+
+To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the [advanced documentation]({{%relref "docs/advanced" %}}). The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.
+
+LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like `huggingface://` or `github://`), which is then expanded into complete URLs.
+
+The configuration can also be set via an environment variable. For instance:
+
+```
+# Command-Line Arguments
+local-ai github://owner/repo/file.yaml@branch
+
+# Environment Variable
+MODELS="github://owner/repo/file.yaml@branch,github://owner/repo/file.yaml@branch" local-ai
+```
+
+Here's an example to initiate the **phi-2** model:
+
+```bash
+docker run -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml
+```
+
+{{% alert icon="" %}}
+The model configurations used in the quickstart are accessible here: [https://github.com/mudler/LocalAI/tree/master/embedded/models](https://github.com/mudler/LocalAI/tree/master/embedded/models). Contributions are welcome; please feel free to submit a Pull Request.
+
+The `phi-2` model configuration from the quickstart is expanded from [https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml](https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml).
+{{% /alert %}}
+
+## Example: Customizing the Prompt Template
+
+To modify the prompt template, create a Github gist or a Pastebin file, and copy the content from [https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml](https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml). Alter the fields as needed:
+
+```yaml
+name: phi-2
+context_size: 2048
+f16: true
+threads: 11
+gpu_layers: 90
+mmap: true
+parameters:
+  # Reference any HF model or a local file here
+  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
+  temperature: 0.2
+  top_k: 40
+  top_p: 0.95
+template:
+  
+  chat: &template |
+    Instruct: {{.Input}}
+    Output:
+  # Modify the prompt template here ^^^ as per your requirements
+  completion: *template
+```
+
+Then, launch LocalAI using your gist's URL:
+
+```bash
+## Important! Substitute with your gist's URL!
+docker run -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core https://gist.githubusercontent.com/xxxx/phi-2.yaml
+```
+
+## Next Steps
+
+- Visit the [advanced section]({{%relref "docs/advanced" %}}) for more insights on prompt templates and configuration files.
+- To learn about fine-tuning an LLM model, check out the [fine-tuning section]({{%relref "docs/advanced/fine-tuning" %}}).
--- a/docs/content/docs/getting-started/manual.md
+++ b/docs/content/docs/getting-started/manual.md
@ -0,0 +1,150 @@
+ 
+++
+disableToc = false
+title = "Run models manually"
+weight = 5
+icon = "rocket_launch"
+
+++
+
+
+1. Ensure you have a model file, a configuration YAML file, or both. Customize model defaults and specific settings with a configuration file. For advanced configurations, refer to the [Advanced Documentation](docs/advanced).
+
+2. For GPU Acceleration instructions, visit [GPU acceleration](docs/features/gpu-acceleration).
+
+{{< tabs tabTotal="5" >}}
+{{% tab tabName="Docker" %}}
+
+```bash
+# Prepare the models into the `model` directory
+mkdir models
+
+# copy your models to it
+cp your-model.gguf models/
+
+# run the LocalAI container
+docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
+# You should see:
+# 
+# ┌───────────────────────────────────────────────────┐
+# │                   Fiber v2.42.0                   │
+# │               http://127.0.0.1:8080               │
+# │       (bound on host 0.0.0.0 and port 8080)       │
+# │                                                   │
+# │ Handlers ............. 1  Processes ........... 1 │
+# │ Prefork ....... Disabled  PID ................. 1 │
+# └───────────────────────────────────────────────────┘
+
+# Try the endpoint with curl
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+     "model": "your-model.gguf",
+     "prompt": "A long time ago in a galaxy far, far away",
+     "temperature": 0.7
+   }'
+```
+
+{{% alert note %}}
+- If running on Apple Silicon (ARM) it is **not** suggested to run on Docker due to emulation. Follow the [build instructions]({{%relref "docs/getting-started/build" %}}) to use Metal acceleration for full GPU support.
+- If you are running Apple x86_64 you can use `docker`, there is no additional gain into building it from source.
+{{% /alert %}}
+
+{{% /tab %}}
+{{% tab tabName="Docker compose" %}}
+
+```bash
+# Clone LocalAI
+git clone https://github.com/go-skynet/LocalAI
+
+cd LocalAI
+
+# (optional) Checkout a specific LocalAI tag
+# git checkout -b build <TAG>
+
+# copy your models to models/
+cp your-model.gguf models/
+
+# (optional) Edit the .env file to set things like context size and threads
+# vim .env
+
+# start with docker compose
+docker compose up -d --pull always
+# or you can build the images with:
+# docker compose up -d --build
+
+# Now API is accessible at localhost:8080
+curl http://localhost:8080/v1/models
+# {"object":"list","data":[{"id":"your-model.gguf","object":"model"}]}
+
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+     "model": "your-model.gguf",
+     "prompt": "A long time ago in a galaxy far, far away",
+     "temperature": 0.7
+   }'
+```
+
+Note: If you are on Windows, please make sure the project is on the Linux Filesystem, otherwise loading models might be slow. For more Info: [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems)
+
+{{% /tab %}}
+
+{{% tab tabName="Kubernetes" %}}
+
+For installing LocalAI in Kubernetes, you can use the following helm chart:
+
+```bash
+# Install the helm repository
+helm repo add go-skynet https://go-skynet.github.io/helm-charts/
+# Update the repositories
+helm repo update
+# Get the values
+helm show values go-skynet/local-ai > values.yaml
+
+# Edit the values value if needed
+# vim values.yaml ...
+
+# Install the helm chart
+helm install local-ai go-skynet/local-ai -f values.yaml
+```
+
+{{% /tab %}}
+{{% tab tabName="From binary" %}}
+
+LocalAI binary releases are available in [Github](https://github.com/go-skynet/LocalAI/releases).
+
+{{% /tab %}}
+
+{{% tab tabName="From source" %}}
+
+See the [build section]({{%relref "docs/getting-started/build" %}}).
+  
+{{% /tab %}}
+
+{{< /tabs >}}
+
+
+### Example (Docker)
+
+```bash
+mkdir models
+
+# Download luna-ai-llama2 to models/
+wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2
+
+# Use a template from the examples
+cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl
+
+docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
+
+# Now API is accessible at localhost:8080
+curl http://localhost:8080/v1/models
+# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}
+
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "luna-ai-llama2",
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.9
+   }'
+
+# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}
+```
+
+For more model configurations, visit the [Examples Section](https://github.com/mudler/LocalAI/tree/master/examples/configurations).
--- a/docs/content/docs/getting-started/quickstart.md
+++ b/docs/content/docs/getting-started/quickstart.md
@ -0,0 +1,187 @@
+ 
+++
+disableToc = false
+title = "Quickstart"
+weight = 3
+url = '/basics/getting_started/'
+icon = "rocket_launch"
+
+++
+
+**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing. It allows you to run [LLMs]({{%relref "docs/features/text-generation" %}}), generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures.
+
+## Installation Methods
+
+LocalAI is available as a container image and binary, compatible with various container engines like Docker, Podman, and Kubernetes. Container images are published on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest) and [Dockerhub](https://hub.docker.com/r/localai/localai). Binaries can be downloaded from [GitHub](https://github.com/mudler/LocalAI/releases).
+
+
+{{% alert icon="💡" %}}
+
+**Hardware Requirements:** The hardware requirements for LocalAI vary based on the [model size] and [quantization] method used. For performance benchmarks with different backends, such as `llama.cpp`, visit [this link](https://github.com/ggerganov/llama.cpp#memorydisk-requirements). The `rwkv` backend is noted for its lower resource consumption.
+
+{{% /alert %}}
+
+## Prerequisites
+
+Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:
+
+- [Install Docker Desktop (Mac, Windows, Linux)](https://docs.docker.com/get-docker/)
+- [Install Podman (Linux)](https://podman.io/getting-started/installation)
+- [Install Docker engine (Servers)](https://docs.docker.com/engine/install/#get-started)
+
+
+## Running Models
+
+> _Do you have already a model file? Skip to [Run models manually]({{%relref "docs/getting-started/manual" %}})_.
+
+LocalAI allows one-click runs with popular models. It downloads the model and starts the API with the model loaded. 
+
+There are different categories of models: [LLMs]({{%relref "docs/features/text-generation" %}}), [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) , [Embeddings]({{%relref "docs/features/embeddings" %}}), [Audio to Text]({{%relref "docs/features/audio-to-text" %}}), and [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) depending on the backend being used and the model architecture.
+
+{{% alert icon="💡" %}}
+
+To customize the models, see [Model customization]({{%relref "docs/getting-started/customize-model" %}}). For more model configurations, visit the [Examples Section](https://github.com/mudler/LocalAI/tree/master/examples/configurations).
+{{% /alert %}}
+
+{{< tabs tabTotal="3" >}}
+{{% tab tabName="CPU-only" %}}
+
+> 💡Don't need GPU acceleration? use the CPU images which are lighter and do not have Nvidia dependencies
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core phi-2``` |
+| [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core rhasspy-voice-en-us-amy``` |
+| coqui | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg coqui``` |
+| bark | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg bark``` |
+| vall-e-x | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core dolphin-2.5-mixtral-8x7b``` |
+{{% /tab %}}
+{{% tab tabName="GPU (CUDA 11)" %}}
+
+
+> To know which version of CUDA do you have available, you can check with `nvidia-smi` or `nvcc --version` see also [GPU acceleration]({{%relref "docs/features/gpu-acceleration" %}}).
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core phi-2``` |
+| [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core rhasspy-voice-en-us-amy``` |
+| coqui | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 coqui``` |
+| bark | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 bark``` |
+| vall-e-x | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core dolphin-2.5-mixtral-8x7b``` |
+{{% /tab %}}
+
+
+{{% tab tabName="GPU (CUDA 12)" %}}
+
+> To know which version of CUDA do you have available, you can check with `nvidia-smi` or `nvcc --version` see also [GPU acceleration]({{%relref "docs/features/gpu-acceleration" %}}).
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core phi-2``` |
+| [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core rhasspy-voice-en-us-amy``` |
+| coqui | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 coqui``` |
+| bark | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 bark``` |
+| vall-e-x | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core dolphin-2.5-mixtral-8x7b``` |
+{{% /tab %}}
+
+{{< /tabs >}}
+
+{{% alert icon="💡" %}}
+**Tip** You can actually specify multiple models to start an instance with the models loaded, for example to have both llava and phi-2 configured:
+
+```bash
+docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core llava phi-2
+```
+
+{{% /alert %}}
+
+## Container images
+
+LocalAI provides a variety of images to support different environments. These images are available on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags) and [Dockerhub](https://hub.docker.com/r/localai/localai).
+
+For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don't have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the [build section]({{%relref "docs/getting-started/build" %}}).
+
+{{% alert icon="💡" %}}
+
+**Available Images Types**:
+
+- Images ending with `-core` are smaller images without predownload python dependencies. Use these images if you plan to use `llama.cpp`, `stablediffusion-ncn`, `tinydream` or `rwkv` backends - if you are not sure which one to use, do **not** use these images.
+- FFMpeg is **not** included in the default images due to [its licensing](https://www.ffmpeg.org/legal.html). If you need FFMpeg, use the images ending with `-ffmpeg`. Note that `ffmpeg` is needed in case of using `audio-to-text` LocalAI's features.
+- If using old and outdated CPUs and no GPUs you might need to set `REBUILD` to `true` as environment variable along with options to disable the flags which your CPU does not support, however note that inference will perform poorly and slow. See also [flagset compatibility]({{%relref "docs/getting-started/build#cpu-flagset-compatibility" %}}).
+
+{{% /alert %}}
+
+{{< tabs tabTotal="3" >}}
+{{% tab tabName="Vanilla / CPU Images" %}}
+
+| Description | Quay | Dockerhub |
+| --- | --- | --- |
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master` | `localai/localai:master` |
+| Latest tag | `quay.io/go-skynet/local-ai:latest` | `localai/localai:latest` |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}` | `localai/localai:{{< version >}}` |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg` | `localai/localai:{{< version >}}-ffmpeg` |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg-core` | `localai/localai:{{< version >}}-ffmpeg-core` |
+
+{{% /tab %}}
+
+{{% tab tabName="GPU Images CUDA 11" %}}
+
+
+| Description | Quay | Dockerhub |
+| --- | --- | --- |
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master-cublas-cuda11` | `localai/localai:master-cublas-cuda11` |
+| Latest tag | `quay.io/go-skynet/local-ai:latest-cublas-cuda11` | `localai/localai:latest-cublas-cuda11` |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11` | `localai/localai:{{< version >}}-cublas-cuda11` |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg` | `localai/localai:{{< version >}}-cublas-cuda11-ffmpeg` |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg-core` | `localai/localai:{{< version >}}-cublas-cuda11-ffmpeg-core` |
+
+{{% /tab %}}
+
+{{% tab tabName="GPU Images CUDA 12" %}}
+
+
+| Description | Quay | Dockerhub |
+| --- | --- | --- |
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master-cublas-cuda12` | `localai/localai:master-cublas-cuda12` |
+| Latest tag | `quay.io/go-skynet/local-ai:latest-cublas-cuda12` | `localai/localai:latest-cublas-cuda12` |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12` | `localai/localai:{{< version >}}-cublas-cuda12` |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg` | `localai/localai:{{< version >}}-cublas-cuda12-ffmpeg` |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg-core` | `localai/localai:{{< version >}}-cublas-cuda12-ffmpeg-core` |
+
+
+{{% /tab %}}
+
+{{< /tabs >}}
+
+## What's next?
+
+Explore further resources and community contributions:
+
+- [Community How to's](https://io.midori-ai.xyz/howtos/)
+- [Examples](https://github.com/mudler/LocalAI/tree/master/examples#examples)
+
+[![Screenshot from 2023-04-26 23-59-55](https://user-images.githubusercontent.com/2420543/234715439-98d12e03-d3ce-4f94-ab54-2b256808e05e.png)](https://github.com/mudler/LocalAI/tree/master/examples#examples)
--- a/docs/content/docs/integrations.md
+++ b/docs/content/docs/integrations.md
@ -0,0 +1,34 @@
+++
+disableToc = false
+title = "Integrations"
+weight = 19
+icon = "rocket_launch"
+
+++
+
+## Community integrations
+
+List of projects that are using directly LocalAI behind the scenes:
+
+- https://github.com/sozercan/aikit
+- https://github.com/aorumbayev/autogpt4all
+- https://github.com/mudler/LocalAGI
+
+## The following softwares has out-of-the-box integrations with LocalAI
+
+LocalAI can be used as a drop-in replacement, however, the following projects provides specific integrations with LocalAI:
+
+- [AnythingLLM](https://github.com/Mintplex-Labs/anything-llm)
+- [Logseq GPT3 OpenAI plugin](https://github.com/briansunter/logseq-plugin-gpt3-openai) allows to set a base URL, and works with LocalAI.
+- https://github.com/longy2k/obsidian-bmo-chatbot
+- https://github.com/FlowiseAI/Flowise
+- https://github.com/k8sgpt-ai/k8sgpt
+- https://github.com/kairos-io/kairos
+- https://github.com/langchain4j/langchain4j
+- https://github.com/henomis/lingoose
+- https://github.com/trypromptly/LLMStack
+- https://github.com/mattermost/openops
+- https://github.com/charmbracelet/mods
+- https://github.com/cedriking/spark
+  
+Feel free to open up a [issue](https://github.com/go-skynet/localai-website/issues) to get a page for your project made or if you see a error on one of the pages.!
--- a/docs/content/docs/overview.md
+++ b/docs/content/docs/overview.md
@ -0,0 +1,139 @@
+
+++
+title = "Overview"
+weight = 1
+toc = true
+description = "What is LocalAI?"
+tags = ["Beginners"]
+categories = [""]
+author = "Ettore Di Giacinto"
+# This allows to overwrite the landing page
+url = '/'
+icon = "info"
+++
+
+<p align="center">
+<a href="https://localai.io"><img width=512 src="https://github.com/go-skynet/LocalAI/assets/2420543/0966aa2a-166e-4f99-a3e5-6c915fc997dd"></a>
+</p               >
+
+<p align="center">
+<a href="https://github.com/go-skynet/LocalAI/fork" target="blank">
+<img src="https://img.shields.io/github/forks/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI forks"/>
+</a>
+<a href="https://github.com/go-skynet/LocalAI/stargazers" target="blank">
+<img src="https://img.shields.io/github/stars/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI stars"/>
+</a>
+<a href="https://github.com/go-skynet/LocalAI/pulls" target="blank">
+<img src="https://img.shields.io/github/issues-pr/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI pull-requests"/>
+</a>
+<a href='https://github.com/go-skynet/LocalAI/releases'>
+<img src='https://img.shields.io/github/release/go-skynet/LocalAI?&label=Latest&style=for-the-badge'>
+</a>
+</p>
+
+[<img src="https://img.shields.io/badge/dockerhub-images-important.svg?logo=Docker">](https://hub.docker.com/r/localai/localai)
+[<img src="https://img.shields.io/badge/quay.io-images-important.svg?">](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest)
+
+> 💡 Get help - [❓FAQ](https://localai.io/faq/) [❓How tos](https://io.midori-ai.xyz/howtos/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [💭Discord](https://discord.gg/uJAeKSAGDy)
+>
+> [💻 Quickstart](https://localai.io/basics/getting_started/) [📣 News](https://localai.io/basics/news/) [ 🛫 Examples ](https://github.com/go-skynet/LocalAI/tree/master/examples/) [ 🖼️ Models ](https://localai.io/models/) [ 🚀 Roadmap ](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
+
+**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures. Does not require GPU. It is maintained by [mudler](https://github.com/mudler).
+
+<p align="center">
+<a href="https://twitter.com/LocalAI_API" target="blank">
+<img src="https://img.shields.io/twitter/follow/LocalAI_API?label=Follow: LocalAI_API&style=social" alt="Follow LocalAI_API"/>
+</a>
+<a href="https://discord.gg/uJAeKSAGDy" target="blank">
+<img src="https://dcbadge.vercel.app/api/server/uJAeKSAGDy?style=flat-square&theme=default-inverted" alt="Join LocalAI Discord Community"/>
+</a>
+
+In a nutshell:
+
+- Local, OpenAI drop-in alternative REST API. You own your data.
+- NO GPU required. NO Internet access is required either
+  - Optional, GPU Acceleration is available. See also the [build section](https://localai.io/basics/build/index.html).
+- Supports multiple models
+- 🏃 Once loaded the first time, it keep models loaded in memory for faster inference
+- ⚡ Doesn't shell-out, but uses bindings for a faster inference and better performance.
+
+LocalAI is focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome!
+
+Note that this started just as a fun weekend project by [mudler](https://github.com/mudler) in order to try to create the necessary pieces for a full AI assistant like `ChatGPT`: the community is growing fast and we are working hard to make it better and more stable. If you want to help, please consider contributing (see below)!
+
+
+## 🚀 Features
+
+- 📖 [Text generation with GPTs](https://localai.io/features/text-generation/) (`llama.cpp`, `gpt4all.cpp`, ... [:book: and more](https://localai.io/model-compatibility/index.html#model-compatibility-table))
+- 🗣 [Text to Audio](https://localai.io/features/text-to-audio/)
+- 🔈 [Audio to Text](https://localai.io/features/audio-to-text/) (Audio transcription with `whisper.cpp`)
+- 🎨 [Image generation with stable diffusion](https://localai.io/features/image-generation)
+- 🔥 [OpenAI functions](https://localai.io/features/openai-functions/) 🆕
+- 🧠 [Embeddings generation for vector databases](https://localai.io/features/embeddings/)
+- ✍️ [Constrained grammars](https://localai.io/features/constrained_grammars/)
+- 🖼️ [Download Models directly from Huggingface ](https://localai.io/models/)
+- 🆕 [Vision API](https://localai.io/features/gpt-vision/)
+
+## How does it work?
+
+LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including [ggml](https://github.com/ggerganov/ggml), to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.
+
+LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, ...). You can check [the model compatibility table]({{%relref "docs/reference/compatibility-table" %}}) to learn about all the components of LocalAI.
+
+![localai](https://github.com/go-skynet/localai-website/assets/2420543/6492e685-8282-4217-9daa-e229a31548bc)
+
+## Contribute and help
+
+To help the project you can:
+
+- If you have technological skills and want to contribute to development, have a look at the open issues. If you are new you can have a look at the [good-first-issue](https://github.com/go-skynet/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and [help-wanted](https://github.com/go-skynet/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels.
+
+- If you don't have technological skills you can still help improving documentation or [add examples](https://github.com/go-skynet/LocalAI/tree/master/examples) or share your user-stories with our community, any help and contribution is welcome!
+
+## 🌟 Star history
+
+[![LocalAI Star history Chart](https://api.star-history.com/svg?repos=go-skynet/LocalAI&type=Date)](https://star-history.com/#go-skynet/LocalAI&Date)
+
+## 📖 License
+
+LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).
+
+MIT - Author Ettore Di Giacinto
+
+## 🙇 Acknowledgements
+
+LocalAI couldn't have been built without the help of great software already available from the community. Thank you!
+
+- [llama.cpp](https://github.com/ggerganov/llama.cpp)
+- https://github.com/tatsu-lab/stanford_alpaca
+- https://github.com/cornelk/llama-go for the initial ideas
+- https://github.com/antimatter15/alpaca.cpp
+- https://github.com/EdVince/Stable-Diffusion-NCNN
+- https://github.com/ggerganov/whisper.cpp
+- https://github.com/saharNooby/rwkv.cpp
+- https://github.com/rhasspy/piper
+- https://github.com/cmp-nct/ggllm.cpp
+
+
+
+## Backstory
+
+As much as typical open source projects starts, I, [mudler](https://github.com/mudler/), was fiddling around with [llama.cpp](https://github.com/ggerganov/llama.cpp) over my long nights and wanted to have a way to call it from `go`, as I am a Golang developer and use it extensively. So I've created `LocalAI` (or what was initially known as `llama-cli`) and added an API to it.
+
+But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.
+
+Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like `llama.cpp`. Go is good at backends and API and is easy to maintain. And hey, don't forget that I'm all about sharing the love. That's why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.
+
+As if that wasn't exciting enough, as the project gained traction, [mkellerman](https://github.com/mkellerman) and [Aisuko](https://github.com/Aisuko) jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn't be happier about it!
+
+Oh, and let's not forget the real MVP here—[llama.cpp](https://github.com/ggerganov/llama.cpp). Without this extraordinary piece of software, LocalAI wouldn't even exist. So, a big shoutout to the community for making this magic happen!
+
+## 🤗 Contributors
+
+This is a community project, a special thanks to our contributors! 🤗
+<a href="https://github.com/go-skynet/LocalAI/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=go-skynet/LocalAI" />
+</a>
+<a href="https://github.com/go-skynet/LocalAI-website/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=go-skynet/LocalAI-website" />
+</a>
--- a/docs/content/docs/reference/_index.en.md
+++ b/docs/content/docs/reference/_index.en.md
@ -0,0 +1,11 @@
+---
+weight: 23
+title: "References"
+description: "Reference"
+icon: science
+lead: ""
+date: 2020-10-06T08:49:15+00:00
+lastmod: 2020-10-06T08:49:15+00:00
+draft: false
+images: []
+---
--- a/docs/content/docs/reference/compatibility-table.md
+++ b/docs/content/docs/reference/compatibility-table.md
@ -0,0 +1,55 @@
+
+++
+disableToc = false
+title = "Model compatibility table"
+weight = 24
+++
+
+Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.
+
+{{% alert note %}}
+
+LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See [the advanced section]({{%relref "docs/advanced" %}}) for more details.
+
+{{% /alert %}}
+
+| Backend and Bindings                                                             | Compatible models     | Completion/Chat endpoint | Capability | Embeddings support                | Token stream support | Acceleration |
+|----------------------------------------------------------------------------------|-----------------------|--------------------------|---------------------------|-----------------------------------|----------------------|--------------|
+| [llama.cpp]({{%relref "docs/features/text-generation#llama.cpp" %}})        | Vicuna, Alpaca, LLaMa | yes                      | GPT and Functions                        | yes** | yes                  | CUDA, openCL, cuBLAS, Metal |
+| [gpt4all-llama](https://github.com/nomic-ai/gpt4all)      | Vicuna, Alpaca, LLaMa | yes                      | GPT                        | no                                | yes                  | N/A  |
+| [gpt4all-mpt](https://github.com/nomic-ai/gpt4all)          | MPT                   | yes                      | GPT                        | no                                | yes                  | N/A  |
+| [gpt4all-j](https://github.com/nomic-ai/gpt4all)           | GPT4ALL-J             | yes                      | GPT                        | no                                | yes                  | N/A  |
+| [falcon-ggml](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))        | Falcon (*)             | yes                      | GPT                        | no                                | no                   | N/A |
+| [gpt2](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))             | GPT2, Cerebras    | yes                      | GPT                        | no                                | no                   | N/A |
+| [dolly](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))            | Dolly                 | yes                      | GPT                        | no                                | no                   | N/A |
+| [gptj](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))        | GPTJ             | yes                      | GPT                        | no                                | no                   | N/A |
+| [mpt](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))         | MPT     | yes                      | GPT                        | no                                | no                   | N/A |
+| [replit](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))        | Replit             | yes                      | GPT                        | no                                | no                   | N/A |
+| [gptneox](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))        | GPT NeoX, RedPajama, StableLM             | yes                      | GPT                        | no                                | no                   | N/A |
+| [starcoder](https://github.com/ggerganov/ggml) ([binding](https://github.com/go-skynet/go-ggml-transformers.cpp))        | Starcoder             | yes                      | GPT                        | no                                | no                   | N/A|
+| [bloomz](https://github.com/NouamaneTazi/bloomz.cpp) ([binding](https://github.com/go-skynet/bloomz.cpp))       | Bloom                 | yes                      | GPT                        | no                                | no                   | N/A |
+| [rwkv](https://github.com/saharNooby/rwkv.cpp) ([binding](https://github.com/donomii/go-rwkv.cpp))       | rwkv                 | yes                      | GPT                        | no                                | yes                   | N/A  |
+| [bert](https://github.com/skeskinen/bert.cpp) ([binding](https://github.com/go-skynet/go-bert.cpp)) | bert                  | no                       | Embeddings only                  | yes                               | no                   | N/A |
+| [whisper](https://github.com/ggerganov/whisper.cpp)         | whisper               | no                       | Audio                 | no                                | no                   | N/A |
+| [stablediffusion](https://github.com/EdVince/Stable-Diffusion-NCNN) ([binding](https://github.com/mudler/go-stable-diffusion))        | stablediffusion               | no                       | Image                 | no                                | no                   | N/A |
+| [langchain-huggingface](https://github.com/tmc/langchaingo)                                                                    | Any text generators available on HuggingFace through API | yes                      | GPT                        | no                                | no                   | N/A |
+| [piper](https://github.com/rhasspy/piper) ([binding](https://github.com/mudler/go-piper))                                                                     | Any piper onnx model | no                      | Text to voice                        | no                                | no                   | N/A |
+| [falcon](https://github.com/cmp-nct/ggllm.cpp/tree/c12b2d65f732a0d8846db2244e070f0f3e73505c) ([binding](https://github.com/mudler/go-ggllm.cpp))                                                                      | Falcon *** | yes                      | GPT                        | no                                | yes                   | CUDA |
+| [sentencetransformers](https://github.com/UKPLab/sentence-transformers) | BERT                   | no                       | Embeddings only                  | yes                               | no                   | N/A |
+| `bark`  | bark                   | no                       | Audio generation                  | no                               | no                   | yes |
+| `autogptq` | GPTQ                   | yes                       | GPT                  | yes                               | no                   | N/A |
+| `exllama`  | GPTQ                   | yes                       | GPT only                  | no                               | no                   | N/A |
+| `diffusers`  | SD,...                   | no                       | Image generation    | no                               | no                   | N/A |
+| `vall-e-x` | Vall-E    | no                       | Audio generation and Voice cloning    | no                               | no                   | CPU/CUDA |
+| `vllm` | Various GPTs and quantization formats | yes                      | GPT             | no | no                  | CPU/CUDA |
+| `exllama2`  | GPTQ                   | yes                       | GPT only                  | no                               | no                   | N/A |
+| `transformers-musicgen`  |                    | no                       | Audio generation                | no                               | no                   | N/A |
+| [tinydream](https://github.com/symisc/tiny-dream#tiny-dreaman-embedded-header-only-stable-diffusion-inference-c-librarypixlabiotiny-dream)         | stablediffusion               | no                       | Image                 | no                                | no                   | N/A |
+| `coqui` | Coqui    | no                       | Audio generation and Voice cloning    | no                               | no                   | CPU/CUDA |
+| `petals` | Various GPTs and quantization formats | yes                      | GPT             | no | no                  | CPU/CUDA |
+
+Note: any backend name listed above can be used in the `backend` field of the model configuration file (See [the advanced section]({{%relref "docs/advanced" %}})).
+
+- \* 7b ONLY
+- ** doesn't seem to be accurate
+- *** 7b and 40b with the `ggccv` format, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
--- a/docs/content/docs/whats-new.md
+++ b/docs/content/docs/whats-new.md
@ -0,0 +1,458 @@
+++
+disableToc = false
+title = "News"
+weight = 7
+url = '/basics/news/'
+icon = "newspaper"
+++
+
+Release notes have been now moved completely over Github releases. 
+
+You can see the release notes [here](https://github.com/mudler/LocalAI/releases).
+
+# Older release notes
+
+## 04-12-2023: __v2.0.0__
+
+This release brings a major overhaul in some backends. 
+
+Breaking/important changes:
+- Backend rename: `llama-stable` renamed to `llama-ggml` {{< pr "1287" >}}
+- Prompt template changes: {{< pr "1254" >}} (extra space in roles)
+- Apple metal bugfixes: {{< pr "1365" >}}
+
+New:
+- Added support for LLaVa and OpenAI Vision API support ({{< pr "1254" >}})
+- Python based backends are now using conda to track env dependencies ( {{< pr "1144" >}} )
+- Support for parallel requests ( {{< pr "1290"  >}} )
+- Support for transformers-embeddings ( {{< pr "1308"  >}})
+- Watchdog for backends ( {{< pr "1341"  >}}). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI's llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
+- Whisper.cpp updates ( {{< pr "1302" >}} )
+- Petals backend ( {{< pr "1350" >}} )
+- Full LLM fine-tuning example to use with LocalAI: https://localai.io/advanced/fine-tuning/
+
+Due to the python dependencies size of images grew in size. 
+If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with `-core`.
+
+Full changelog: https://github.com/mudler/LocalAI/releases/tag/v2.0.0
+
+## 30-10-2023: __v1.40.0__
+
+This release is a preparation before v2 - the efforts now will be to refactor, polish and add new backends. Follow up on: https://github.com/mudler/LocalAI/issues/1126
+
+## Hot topics
+
+This release now brings the `llama-cpp` backend which is a c++ backend tied to llama.cpp. It follows more closely and tracks recent versions of llama.cpp. It is not feature compatible with the current `llama` backend but plans are to sunset the current `llama` backend in favor of this one. This one will be probably be the latest release containing the older `llama` backend written in go and c++. The major improvement with this change is that there are less layers that could be expose to potential bugs - and as well it ease out maintenance as well.
+
+### Support for  ROCm/HIPBLAS 
+
+This release bring support for AMD thanks to @65a .  See more details in {{< pr "1100" >}}
+
+### More CLI commands
+
+Thanks to @jespino now the local-ai binary has more subcommands allowing to manage the gallery or try out directly inferencing, check it out!
+
+[Release notes](https://github.com/mudler/LocalAI/releases/tag/v1.40.0)
+
+## 25-09-2023: __v1.30.0__
+
+This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!
+
+Check out the documentation for vllm [here](https://localai.io/model-compatibility/vllm/) and Vall-E-X [here](https://localai.io/model-compatibility/vall-e-x/)
+
+[Release notes](https://github.com/mudler/LocalAI/releases/tag/v1.30.0)
+
+## 26-08-2023: __v1.25.0__
+
+Hey everyone, [Ettore](https://github.com/mudler/) here, I'm so happy to share this release out - while this summer is hot apparently doesn't stop LocalAI development  :)
+
+This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!
+
+### Attention 🚨
+
+From this release the `llama` backend supports only `gguf` files (see {{< pr "943" >}}). LocalAI  however still supports `ggml` files. We ship a version of llama.cpp before that change in a separate backend, named `llama-stable` to allow still loading `ggml` files. If you were specifying the `llama` backend manually to load `ggml` files from this release you should use `llama-stable` instead, or do not specify a backend at all (LocalAI will automatically handle this).
+
+### Image generation enhancements
+
+The [Diffusers]({{%relref "docs/features/image-generation" %}}) backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the [Diffusers]({{%relref "docs/features/image-generation" %}}) documentation for more information.
+
+### Lora adapters
+
+Now it's possible to load lora adapters for llama.cpp. See {{< pr "955" >}} for more information.
+
+### Device management
+
+It is now possible for single-devices with one GPU to specify `--single-active-backend` to allow only one backend active at the time {{< pr "925" >}}.
+
+### Community spotlight
+
+
+
+#### Resources management
+
+Thanks to the continous community efforts (another cool contribution from {{< github "dave-gray101" >}} ) now it's possible to shutdown a backend programmatically via the API.
+There is an ongoing effort in the community to better handling of resources. See also the [🔥Roadmap](https://localai.io/#-hot-topics--roadmap).
+
+#### New how-to section
+
+Thanks to the community efforts now we have a new [how-to website](https://io.midori-ai.xyz/howtos/) with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to {{< github "lunamidori5" >}} from the community for the impressive efforts on this!
+
+#### 💡 More examples!
+
+- Open source autopilot? See the new addition by {{< github "gruberdev" >}} in our [examples](https://github.com/go-skynet/LocalAI/tree/master/examples/continue) on how to use Continue with LocalAI!
+- Want to try LocalAI with Insomnia? Check out the new [Insomnia example](https://github.com/go-skynet/LocalAI/tree/master/examples/insomnia) by {{< github "dave-gray101" >}}!
+
+#### LocalAGI in discord!
+
+Did you know that we have now few cool bots in our Discord? come check them out! We also have an instance of [LocalAGI](https://github.com/mudler/LocalAGI) ready to help you out!
+
+
+
+### Changelog summary
+
+#### Breaking Changes 🛠
+* feat: bump llama.cpp, add gguf support by {{< github "mudler" >}} in {{< pr "943" >}}
+
+#### Exciting New Features 🎉
+
+* feat(Makefile): allow to restrict backend builds by {{< github "mudler" >}} in {{< pr "890" >}}
+* feat(diffusers): various enhancements by {{< github "mudler" >}} in {{< pr "895" >}}
+* feat: make initializer accept gRPC delay times by {{< github "mudler" >}} in {{< pr "900" >}}
+* feat(diffusers): add DPMSolverMultistepScheduler++, DPMSolverMultistepSchedulerSDE++, guidance_scale by {{< github "mudler" >}} in {{< pr "903" >}}
+* feat(diffusers): overcome prompt limit by {{< github "mudler" >}} in {{< pr "904" >}}
+* feat(diffusers): add img2img and clip_skip, support more kernels schedulers by {{< github "mudler" >}} in {{< pr "906" >}}
+* Usage Features by {{< github "dave-gray101" >}}  in {{< pr "863" >}}
+* feat(diffusers): be consistent with pipelines, support also depthimg2img by {{< github "mudler" >}} in {{< pr "926" >}}
+* feat: add --single-active-backend to allow only one backend active at the time by {{< github "mudler" >}} in {{< pr "925" >}}
+* feat: add llama-stable backend by {{< github "mudler" >}} in {{< pr "932" >}}
+* feat: allow to customize rwkv tokenizer by {{< github "dave-gray101" >}}  in {{< pr "937" >}}
+* feat: backend monitor shutdown endpoint, process based by {{< github "dave-gray101" >}}  in {{< pr "938" >}}
+* feat: Allow to load lora adapters for llama.cpp by {{< github "mudler" >}} in {{< pr "955" >}}
+
+Join our Discord community! our vibrant community is growing fast, and we are always happy to help!  https://discord.gg/uJAeKSAGDy
+
+The full changelog is available [here](https://github.com/go-skynet/LocalAI/releases/tag/v.1.25.0).
+
+--- 
+
+## 🔥🔥🔥🔥 12-08-2023: __v1.24.0__ 🔥🔥🔥🔥
+
+This is release brings four(!) new additional backends to LocalAI: [🐶 Bark]({{%relref "docs/features/text-to-audio#bark" %}}), 🦙 [AutoGPTQ]({{%relref "docs/features/text-generation#autogptq" %}}), [🧨 Diffusers]({{%relref "docs/features/image-generation" %}}), 🦙 [exllama]({{%relref "docs/features/text-generation#exllama" %}}) and a lot of improvements!
+
+### Major improvements:
+
+* feat: add bark and AutoGPTQ by {{< github "mudler" >}} in {{< pr "871" >}}
+* feat: Add Diffusers by {{< github "mudler" >}} in {{< pr "874" >}}
+* feat: add API_KEY list support by {{< github "neboman11" >}} and {{< github "bnusunny" >}} in {{< pr "877" >}}
+* feat: Add exllama by {{< github "mudler" >}} in {{< pr "881" >}}
+* feat: pre-configure LocalAI galleries by {{< github "mudler" >}} in {{< pr "886" >}}
+
+### 🐶 Bark
+
+[Bark]({{%relref "docs/features/text-to-audio#bark" %}}) is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it's available in the container images by default.
+
+It can also generate music, see the example: [lion.webm](https://user-images.githubusercontent.com/5068315/230684766-97f5ea23-ad99-473c-924b-66b6fab24289.webm)
+
+### 🦙 AutoGPTQ
+
+[AutoGPTQ]({{%relref "docs/features/text-generation#autogptq" %}}) is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
+
+It is targeted mainly for GPU usage only. Check out the [ documentation]({{%relref "docs/features/text-generation" %}}) for usage.
+
+### 🦙 Exllama
+
+[Exllama]({{%relref "docs/features/text-generation#exllama" %}}) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". It is a faster alternative to run LLaMA models on GPU.Check out the [Exllama documentation]({{%relref "docs/features/text-generation#exllama" %}}) for usage.
+
+### 🧨 Diffusers
+
+[Diffusers]({{%relref "docs/features/image-generation#diffusers" %}}) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren't tested yet. Check out the [Diffusers documentation]({{%relref "docs/features/image-generation" %}}) for usage.
+
+### 🔑 API Keys
+
+Thanks to the community contributions now it's possible to specify a list of API keys that can be used to gate API requests.
+
+API Keys can be specified with the `API_KEY` environment variable as a comma-separated list of keys. 
+
+### 🖼️ Galleries
+
+Now by default the model-gallery repositories are configured in the container images
+
+### 💡 New project
+
+[LocalAGI](https://github.com/mudler/LocalAGI) is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed). 
+
+See it [here in action](https://github.com/mudler/LocalAGI/assets/2420543/9ba43b82-dec5-432a-bdb9-8318e7db59a4) planning a trip for San Francisco! 
+
+The full changelog is available [here](https://github.com/go-skynet/LocalAI/releases/tag/v.1.24.0).
+
+--- 
+
+## 🔥🔥 29-07-2023: __v1.23.0__ 🚀
+
+This release focuses mostly on bugfixing and updates, with just a couple of new features:
+
+* feat: add rope settings and negative prompt, drop grammar backend by {{< github "mudler" >}} in {{< pr "797" >}}
+* Added CPU information to entrypoint.sh by @finger42 in {{< pr "794" >}}
+* feat: cancel stream generation if client disappears by @tmm1 in {{< pr "792" >}}
+  
+Most notably, this release brings important fixes for CUDA (and not only):
+
+* fix: add rope settings during model load, fix CUDA by {{< github "mudler" >}} in {{< pr "821" >}}
+* fix: select function calls if 'name' is set in the request by {{< github "mudler" >}} in {{< pr "827" >}}
+* fix: symlink libphonemize in the container by {{< github "mudler" >}} in {{< pr "831" >}}
+  
+{{% alert note %}}
+
+From this release [OpenAI functions]({{%relref "docs/features/openai-functions" %}}) are available in the `llama` backend. The `llama-grammar` has been deprecated. See also [OpenAI functions]({{%relref "docs/features/openai-functions" %}}).
+
+{{% /alert %}}
+
+The full [changelog is available here](https://github.com/go-skynet/LocalAI/releases/tag/v1.23.0)
+
+--- 
+
+## 🔥🔥🔥 23-07-2023: __v1.22.0__ 🚀
+
+* feat: add llama-master backend by {{< github "mudler" >}} in {{< pr "752" >}}
+* [build] pass build type to cmake on libtransformers.a build by @TonDar0n in {{< pr "741" >}}
+* feat: resolve JSONSchema refs (planners) by {{< github "mudler" >}} in {{< pr "774" >}}
+* feat: backends improvements by {{< github "mudler" >}} in {{< pr "778" >}}
+* feat(llama2): add template for chat messages by {{< github "dave-gray101" >}}  in {{< pr "782" >}}
+
+{{% alert note %}}
+
+From this release to use the OpenAI functions you need to use the `llama-grammar` backend. It has been added a `llama` backend for tracking `llama.cpp` master and `llama-grammar` for the grammar functionalities that have not been merged yet upstream. See also [OpenAI functions]({{%relref "docs/features/openai-functions" %}}). Until the feature is merged we will have two llama backends.
+
+{{% /alert %}}
+
+## Huggingface embeddings
+
+In this release is now possible to specify to LocalAI external `gRPC` backends that can be used for inferencing {{< pr "778" >}}. It is now possible to write internal backends in any language, and a `huggingface-embeddings` backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also [Embeddings]({{%relref "docs/features/embeddings" %}}).
+
+## LLaMa 2 has been released!
+
+Thanks to the community effort now LocalAI supports templating for LLaMa2! more at: {{< pr "782" >}} until we update the model gallery with LLaMa2 models!
+
+## Official langchain integration
+
+Progress has been made to support LocalAI with `langchain`. See: https://github.com/langchain-ai/langchain/pull/8134
+
+--- 
+
+## 🔥🔥🔥 17-07-2023: __v1.21.0__ 🚀
+
+* [whisper] Partial support for verbose_json format in transcribe endpoint by `@ldotlopez` in {{< pr "721" >}}
+* LocalAI functions by `@mudler` in {{< pr "726" >}}
+* `gRPC`-based backends by `@mudler` in {{< pr "743" >}}
+* falcon support (7b and 40b) with `ggllm.cpp` by `@mudler` in {{< pr "743" >}}
+
+### LocalAI functions
+
+This allows to run OpenAI functions as described in the OpenAI blog post and documentation: https://openai.com/blog/function-calling-and-other-api-updates.
+
+This is a video of running the same example, locally with `LocalAI`:
+![localai-functions-1](https://github.com/ggerganov/llama.cpp/assets/2420543/5bd15da2-78c1-4625-be90-1e938e6823f1)
+
+And here when it actually picks to reply to the user instead of using functions!
+![functions-2](https://github.com/ggerganov/llama.cpp/assets/2420543/e3f89d15-1d2c-45ab-974f-6c9eb8eae41d)
+
+Note: functions are supported only with `llama.cpp`-compatible models.
+
+A full example is available here: https://github.com/go-skynet/LocalAI/tree/master/examples/functions
+
+### gRPC backends
+
+This is an internal refactor which is not user-facing, however, it allows to ease out maintenance and addition of new backends to LocalAI!
+
+### `falcon` support
+
+Now Falcon 7b and 40b models compatible with https://github.com/cmp-nct/ggllm.cpp are supported as well.
+
+The former, ggml-based backend has been renamed to `falcon-ggml`.
+
+### Default pre-compiled binaries
+
+From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile `local-ai` from scratch on start and switch back to the old behavior, you can set `REBUILD=true` in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the [build section]({{%relref "docs/getting-started/build" %}}) for more information.
+
+[Full release changelog](https://github.com/go-skynet/LocalAI/releases/tag/v1.21.0)
+
+--- 
+
+## 🔥🔥🔥 28-06-2023: __v1.20.0__ 🚀
+
+### Exciting New Features 🎉
+
+* Add Text-to-Audio generation with `go-piper` by {{< github "mudler" >}} in {{< pr "649" >}} See [API endpoints]({{%relref "docs/features/text-to-audio" %}}) in our documentation.
+* Add gallery repository by {{< github "mudler" >}} in {{< pr "663" >}}. See [models]({{%relref "docs/features/model-gallery" %}}) for documentation.
+
+### Container images
+- Standard (GPT + `stablediffusion`): `quay.io/go-skynet/local-ai:v1.20.0`
+- FFmpeg: `quay.io/go-skynet/local-ai:v1.20.0-ffmpeg`
+- CUDA 11+FFmpeg: `quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda11-ffmpeg`
+- CUDA 12+FFmpeg: `quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda12-ffmpeg`
+
+### Updates
+
+Updates to `llama.cpp`, `go-transformers`, `gpt4all.cpp` and `rwkv.cpp`.
+
+The NUMA option was enabled by {{< github "mudler" >}} in {{< pr "684" >}}, along with many new parameters (`mmap`,`mmlock`, ..). See [advanced]({{%relref "docs/advanced" %}}) for the full list of parameters.
+
+### Gallery repositories
+
+In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the `GALLERIES` environment variable. An automatic index of huggingface models is available as well.
+
+For example, now you can start `LocalAI` with the following environment variable to use both galleries:
+
+```bash
+GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:ci-robbot/localai-huggingface-zoo/index.yaml","name":"huggingface"}]
+```
+
+And in runtime you can install a model from huggingface now with:
+
+```bash
+curl http://localhost:8000/models/apply -H "Content-Type: application/json" -d '{ "id": "huggingface@thebloke__open-llama-7b-open-instruct-ggml__open-llama-7b-open-instruct.ggmlv3.q4_0.bin" }'
+```
+
+or a `tts` voice with:
+
+```bash
+curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@voice-en-us-kathleen-low" }'
+```
+
+See also [models]({{%relref "docs/features/model-gallery" %}}) for a complete documentation.
+
+### Text to Audio
+
+Now `LocalAI` uses [piper](https://github.com/rhasspy/piper) and [go-piper](https://github.com/mudler/go-piper) to generate audio from text. This is an experimental feature, and it requires `GO_TAGS=tts` to be set during build. It is enabled by default in the pre-built container images.
+
+To setup audio models, you can use the new galleries, or setup the models manually as described in [the API section of the documentation]({{%relref "docs/features/text-to-audio" %}}).
+
+You can check the full changelog in [Github](https://github.com/go-skynet/LocalAI/releases/tag/v1.20.0)
+
+--- 
+
+## 🔥🔥🔥 19-06-2023: __v1.19.0__ 🚀
+
+- Full CUDA GPU offload support ( [PR](https://github.com/go-skynet/go-llama.cpp/pull/105) by [mudler](https://github.com/mudler). Thanks to [chnyda](https://github.com/chnyda) for handing over the GPU access, and [lu-zero](https://github.com/lu-zero) to help in debugging  )
+- Full GPU Metal Support is now fully functional. Thanks to [Soleblaze](https://github.com/Soleblaze) to iron out the Metal Apple silicon support!
+
+Container images:
+- Standard (GPT + `stablediffusion`): `quay.io/go-skynet/local-ai:v1.19.2`
+- FFmpeg: `quay.io/go-skynet/local-ai:v1.19.2-ffmpeg`
+- CUDA 11+FFmpeg: `quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda11-ffmpeg`
+- CUDA 12+FFmpeg: `quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda12-ffmpeg`
+
+--- 
+
+## 🔥🔥🔥 06-06-2023: __v1.18.0__ 🚀
+
+This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!
+
+We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)!
+
+### New features
+
+- ✨ Added support for `falcon`-based model families (7b)  ( [mudler](https://github.com/mudler) )
+- ✨ Experimental support for Metal Apple Silicon GPU - ( [mudler](https://github.com/mudler) and thanks to [Soleblaze](https://github.com/Soleblaze) for testing! ). See the [build section]({{%relref "docs/getting-started/build#Acceleration" %}}).
+- ✨ Support for token stream in the `/v1/completions` endpoint ( [samm81](https://github.com/samm81) )
+- ✨ Added huggingface backend ( [Evilfreelancer](https://github.com/EvilFreelancer) )
+- 📷 Stablediffusion now can output `2048x2048` images size with `esrgan`! ( [mudler](https://github.com/mudler) )
+
+### Container images
+- 🐋 CUDA container images (arm64, x86_64) ( [sebastien-prudhomme](https://github.com/sebastien-prudhomme) )
+- 🐋 FFmpeg container images (arm64, x86_64) ( [mudler](https://github.com/mudler) )
+
+### Dependencies updates
+
+- 🆙 Bloomz has been updated to the latest ggml changes, including new quantization format ( [mudler](https://github.com/mudler) )
+- 🆙 RWKV has been updated to the new quantization format( [mudler](https://github.com/mudler) )
+- 🆙 [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684) format support for the `llama` models ( [mudler](https://github.com/mudler) )
+- 🆙 gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( [mudler](https://github.com/mudler) )
+
+### Generic
+
+- 🐧 Fully Linux static binary releases ( [mudler](https://github.com/mudler) )
+- 📷 Stablediffusion has been enabled on container images by default ( [mudler](https://github.com/mudler) )
+  Note: You can disable container image rebuilds with `REBUILD=false`
+
+### Examples
+
+- 💡 [AutoGPT](https://github.com/go-skynet/LocalAI/tree/master/examples/autoGPT) example ( [mudler](https://github.com/mudler) )
+- 💡 [PrivateGPT](https://github.com/go-skynet/LocalAI/tree/master/examples/privateGPT) example ( [mudler](https://github.com/mudler) )
+- 💡 [Flowise](https://github.com/go-skynet/LocalAI/tree/master/examples/flowise) example ( [mudler](https://github.com/mudler) )
+
+Two new projects offer now direct integration with LocalAI!
+
+- [Flowise](https://github.com/FlowiseAI/Flowise/pull/123)
+- [Mods](https://github.com/charmbracelet/mods)
+
+[Full release changelog](https://github.com/go-skynet/LocalAI/releases/tag/v1.18.0)
+
+--- 
+
+## 29-05-2023: __v1.17.0__
+
+Support for OpenCL has been added while building from sources.
+
+You can now build LocalAI from source with `BUILD_TYPE=clblas` to have an OpenCL build. See also the [build section]({{%relref "docs/getting-started/build#Acceleration" %}}).
+
+For instructions on how to install OpenCL/CLBlast see [here](https://github.com/ggerganov/llama.cpp#blas-build).
+
+rwkv.cpp has been updated to the new ggml format [commit](https://github.com/saharNooby/rwkv.cpp/commit/dea929f8cad90b7cf2f820c5a3d6653cfdd58c4e).
+
+--- 
+
+## 27-05-2023: __v1.16.0__ 
+
+Now it's possible to automatically download pre-configured models before starting the API. 
+
+Start local-ai with the `PRELOAD_MODELS` containing a list of models from the gallery, for instance to install `gpt4all-j` as `gpt-3.5-turbo`:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]
+```
+
+`llama.cpp` models now can also automatically save the prompt cache state as well by specifying in the model YAML configuration file:
+
+```yaml
+# Enable prompt caching
+
+# This is a file that will be used to save/load the cache. relative to the models directory.
+prompt_cache_path: "alpaca-cache"
+
+# Always enable prompt cache
+prompt_cache_all: true
+```
+
+See also the [advanced section]({{%relref "docs/advanced" %}}).
+
+## Media, Blogs, Social
+
+- [Create a slackbot for teams and OSS projects that answer to documentation](https://mudler.pm/posts/smart-slackbot-for-teams/)
+- [LocalAI meets k8sgpt](https://www.youtube.com/watch?v=PKrDNuJ_dfE) - CNCF Webinar showcasing LocalAI and k8sgpt.
+- [Question Answering on Documents locally with LangChain, LocalAI, Chroma, and GPT4All](https://mudler.pm/posts/localai-question-answering/) by Ettore Di Giacinto
+- [Tutorial to use k8sgpt with LocalAI](https://medium.com/@tyler_97636/k8sgpt-localai-unlock-kubernetes-superpowers-for-free-584790de9b65) - excellent usecase for localAI, using AI to analyse Kubernetes clusters. by Tyller Gillson
+
+## Previous 
+
+- 23-05-2023: __v1.15.0__ released. `go-gpt2.cpp` backend got renamed to `go-ggml-transformers.cpp` updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not `gpt4all-mpt`), Dolly, GPT2 and Starcoder based models. [Binary releases available](https://github.com/go-skynet/LocalAI/releases), various fixes, including {{< pr "341" >}} .
+- 21-05-2023: __v1.14.0__ released. Minor updates to the `/models/apply` endpoint, `llama.cpp` backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. `gpt4all` is still compatible with the old format. 
+- 19-05-2023: __v1.13.0__ released! 🔥🔥 updates to the `gpt4all` and `llama` backend, consolidated CUDA support ( {{< pr "310" >}} thanks to @bubthegreat and @Thireus ), preliminar support for [installing models via API]({{%relref "docs/advanced#" %}}).
+- 17-05-2023:  __v1.12.0__ released! 🔥🔥 Minor fixes, plus CUDA ({{< pr "258" >}}) support for `llama.cpp`-compatible models and image generation ({{< pr "272" >}}).
+- 16-05-2023: 🔥🔥🔥 Experimental support for CUDA ({{< pr "258" >}}) in the `llama.cpp` backend and Stable diffusion CPU image generation ({{< pr "272" >}}) in `master`.
+
+Now LocalAI can generate images too:
+
+| mode=0                                                                                                                | mode=1 (winograd/sgemm)                                                                                                                |
+|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
+| ![b6441997879](https://github.com/go-skynet/LocalAI/assets/2420543/d50af51c-51b7-4f39-b6c2-bf04c403894c)              | ![winograd2](https://github.com/go-skynet/LocalAI/assets/2420543/1935a69a-ecce-4afc-a099-1ac28cb649b3)                |
+
+- 14-05-2023: __v1.11.1__ released! `rwkv` backend patch release
+- 13-05-2023: __v1.11.0__ released! 🔥 Updated `llama.cpp` bindings: This update includes a breaking change in the model files ( https://github.com/ggerganov/llama.cpp/pull/1405 ) - old models should still work with the `gpt4all-llama` backend.
+- 12-05-2023: __v1.10.0__ released! 🔥🔥 Updated `gpt4all` bindings. Added support for GPTNeox (experimental), RedPajama (experimental), Starcoder (experimental), Replit (experimental), MosaicML MPT. Also now `embeddings` endpoint supports tokens arrays. See the [langchain-chroma](https://github.com/go-skynet/LocalAI/tree/master/examples/langchain-chroma) example! Note - this update does NOT include https://github.com/ggerganov/llama.cpp/pull/1405 which makes models incompatible.
+- 11-05-2023: __v1.9.0__ released! 🔥 Important whisper updates ( {{< pr "233" >}} {{< pr "229" >}} ) and extended gpt4all model families support ( {{< pr "232" >}} ). Redpajama/dolly experimental ( {{< pr "214" >}} )
+- 10-05-2023: __v1.8.0__ released! 🔥 Added support for fast and accurate embeddings with `bert.cpp` ( {{< pr "222" >}} )
+- 09-05-2023: Added experimental support for transcriptions endpoint ( {{< pr "211" >}} )
+- 08-05-2023: Support for embeddings with models using the `llama.cpp` backend ( {{< pr "207" >}} )
+- 02-05-2023: Support for `rwkv.cpp` models ( {{< pr "158" >}} ) and for `/edits` endpoint
+- 01-05-2023: Support for SSE stream of tokens in `llama.cpp` backends ( {{< pr "152" >}} )