copy

2025-05-29 08:44:59 +00:00 · 2024-11-25 19:11:18 -08:00 · 2024-11-25 19:11:18 -08:00 · 68be6c5742
commit 68be6c5742
parent 7a34a2dfa9
1 changed files with 24 additions and 9 deletions
--- a/aider/website/_posts/2024-11-21-quantization.md
+++ b/aider/website/_posts/2024-11-21-quantization.md
@ -1,6 +1,6 @@
 ---
-title: Quantization matters
+title: Details matter with open source models
-excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill.
+excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
 highlight_image: /assets/quantization.jpg
 draft: false
 nav_exclude: true
@ -9,18 +9,20 @@ nav_exclude: true
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}
-# Quantization matters
+# Details matter with open source models
 {: .no_toc }
 Open source models like Qwen 2.5 32B Instruct are performing very well on
 aider's code editing benchmark, rivaling closed source frontier models.
-But pay attention to how your model is being quantized, as it
+
-can impact code editing skill.
+But pay attention to how your model is being served and quantized, 
-Heavily quantized models are often used by cloud API providers
+as it can impact code editing skill.
-and local model servers like Ollama or MLX.
+Open source models are often available at a variety of quantizations,
 and can be served with different token limits.
 These details matter when working with code.
 The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
-served both locally and from cloud providers.
+served both locally and from a variety of cloud providers.
 - The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
 - [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
@ -28,8 +30,21 @@ served both locally and from cloud providers.
 - Results from individual providers served via OpenRouter and directly to their own APIs.
 - Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
-The best version of the model rivals GPT-4o, while the worst performer
+The best versions of the model rival GPT-4o, while the worst performer
 is more like the older GPT-4 Turbo.
 Suboptimal choices in quantization and token limits can
 easily produce far worse results.
 This benchmarking effort highlighted a number of pitfalls and details which
 can have a significant impact on the model's ability to correctly edit code:
 - Quantization -- Open source models are often available at dozens of different quantizations.
 - Context window -- Cloud providers can decide how large a context window to accept,
 and they often choose differently. Ollama defaults to a tiny 2k context window,
 and silently discards data that exceeds it.
 - Output token limits -- Open source models are often served with wildly
 differing output token limits. This has a direct impact on how much code the
 model can write or edit in a response.
 ### Sections
 {: .no_toc }