From 68be6c57426ffbe7f68fc4a8baf1b470425d944c Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Mon, 25 Nov 2024 19:11:18 -0800 Subject: [PATCH] copy --- .../website/_posts/2024-11-21-quantization.md | 33 ++++++++++++++----- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/aider/website/_posts/2024-11-21-quantization.md b/aider/website/_posts/2024-11-21-quantization.md index d3712eaca..efba3066b 100644 --- a/aider/website/_posts/2024-11-21-quantization.md +++ b/aider/website/_posts/2024-11-21-quantization.md @@ -1,6 +1,6 @@ --- -title: Quantization matters -excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill. +title: Details matter with open source models +excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill. highlight_image: /assets/quantization.jpg draft: false nav_exclude: true @@ -9,18 +9,20 @@ nav_exclude: true

{{ page.date | date: "%B %d, %Y" }}

{% endif %} -# Quantization matters +# Details matter with open source models {: .no_toc } Open source models like Qwen 2.5 32B Instruct are performing very well on aider's code editing benchmark, rivaling closed source frontier models. -But pay attention to how your model is being quantized, as it -can impact code editing skill. -Heavily quantized models are often used by cloud API providers -and local model servers like Ollama or MLX. + +But pay attention to how your model is being served and quantized, +as it can impact code editing skill. +Open source models are often available at a variety of quantizations, +and can be served with different token limits. +These details matter when working with code. The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, -served both locally and from cloud providers. +served both locally and from a variety of cloud providers. - The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat). - [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D). @@ -28,8 +30,21 @@ served both locally and from cloud providers. - Results from individual providers served via OpenRouter and directly to their own APIs. - Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M). -The best version of the model rivals GPT-4o, while the worst performer +The best versions of the model rival GPT-4o, while the worst performer is more like the older GPT-4 Turbo. +Suboptimal choices in quantization and token limits can +easily produce far worse results. + +This benchmarking effort highlighted a number of pitfalls and details which +can have a significant impact on the model's ability to correctly edit code: + +- Quantization -- Open source models are often available at dozens of different quantizations. +- Context window -- Cloud providers can decide how large a context window to accept, +and they often choose differently. Ollama defaults to a tiny 2k context window, +and silently discards data that exceeds it. +- Output token limits -- Open source models are often served with wildly +differing output token limits. This has a direct impact on how much code the +model can write or edit in a response. ### Sections {: .no_toc }