From 68be6c57426ffbe7f68fc4a8baf1b470425d944c Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Mon, 25 Nov 2024 19:11:18 -0800
Subject: [PATCH] copy

---
 .../website/_posts/2024-11-21-quantization.md | 33 ++++++++++++++-----
 1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/aider/website/_posts/2024-11-21-quantization.md b/aider/website/_posts/2024-11-21-quantization.md
index d3712eaca..efba3066b 100644
--- a/aider/website/_posts/2024-11-21-quantization.md
+++ b/aider/website/_posts/2024-11-21-quantization.md
@@ -1,6 +1,6 @@
 ---
-title: Quantization matters
-excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill.
+title: Details matter with open source models
+excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
 highlight_image: /assets/quantization.jpg
 draft: false
 nav_exclude: true
@@ -9,18 +9,20 @@ nav_exclude: true
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}
 
-# Quantization matters
+# Details matter with open source models
 {: .no_toc }
 
 Open source models like Qwen 2.5 32B Instruct are performing very well on
 aider's code editing benchmark, rivaling closed source frontier models.
-But pay attention to how your model is being quantized, as it
-can impact code editing skill.
-Heavily quantized models are often used by cloud API providers
-and local model servers like Ollama or MLX.
+
+But pay attention to how your model is being served and quantized, 
+as it can impact code editing skill.
+Open source models are often available at a variety of quantizations,
+and can be served with different token limits.
+These details matter when working with code.
 
 The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
-served both locally and from cloud providers.
+served both locally and from a variety of cloud providers.
 
 - The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
 - [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
@@ -28,8 +30,21 @@ served both locally and from cloud providers.
 - Results from individual providers served via OpenRouter and directly to their own APIs.
 - Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
 
-The best version of the model rivals GPT-4o, while the worst performer
+The best versions of the model rival GPT-4o, while the worst performer
 is more like the older GPT-4 Turbo.
+Suboptimal choices in quantization and token limits can
+easily produce far worse results.
+
+This benchmarking effort highlighted a number of pitfalls and details which
+can have a significant impact on the model's ability to correctly edit code:
+
+- Quantization -- Open source models are often available at dozens of different quantizations.
+- Context window -- Cloud providers can decide how large a context window to accept,
+and they often choose differently. Ollama defaults to a tiny 2k context window,
+and silently discards data that exceeds it.
+- Output token limits -- Open source models are often served with wildly
+differing output token limits. This has a direct impact on how much code the
+model can write or edit in a response.
 
 ### Sections
 {: .no_toc }