Bump vLLM version + more options when loading models in vLLM (#1782)

* Bump vLLM version to 0.3.2

* Add vLLM model loading options

* Remove transformers-exllama

* Fix install exllama
This commit is contained in:
Ludovic Leroux 2024-03-01 16:48:53 -05:00 committed by GitHub
parent 1c312685aa
commit 939411300a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
28 changed files with 736 additions and 641 deletions

View file

@ -245,8 +245,18 @@ backend: vllm
parameters:
model: "facebook/opt-125m"
# Decomment to specify a quantization method (optional)
# Uncomment to specify a quantization method (optional)
# quantization: "awq"
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
# gpu_memory_utilization: 0.5
# Uncomment to trust remote code from huggingface
# trust_remote_code: true
# Uncomment to enable eager execution
# enforce_eager: true
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
# swap_space: 2
# Uncomment to specify the maximum length of a sequence (including prompt and output)
# max_model_len: 32768
```
The backend will automatically download the required files in order to run the model.