Bump vLLM version + more options when loading models in vLLM (#1782)

* Bump vLLM version to 0.3.2 * Add vLLM model loading options * Remove transformers-exllama * Fix install exllama
2025-05-20 10:35:01 +00:00 · 2024-03-01 16:48:53 -05:00 · 2024-03-01 16:48:53 -05:00 · 939411300a
commit 939411300a
parent 1c312685aa
28 changed files with 736 additions and 641 deletions
--- a/docs/content/docs/features/text-generation.md
+++ b/docs/content/docs/features/text-generation.md
@ -245,8 +245,18 @@ backend: vllm
 parameters:
    model: "facebook/opt-125m"

-# Decomment to specify a quantization method (optional)
+# Uncomment to specify a quantization method (optional)
 # quantization: "awq"
+# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
+# gpu_memory_utilization: 0.5
+# Uncomment to trust remote code from huggingface
+# trust_remote_code: true
+# Uncomment to enable eager execution
+# enforce_eager: true
+# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
+# swap_space: 2
+# Uncomment to specify the maximum length of a sequence (including prompt and output)
+# max_model_len: 32768
 ```

 The backend will automatically download the required files in order to run the model.