From d54fbd6592c224c62581f2efc3516b684c123420 Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Wed, 27 Nov 2024 15:23:13 -0800
Subject: [PATCH] copy

---
 .../website/_posts/2024-11-21-quantization.md | 45 ++++++++++++-------
 1 file changed, 29 insertions(+), 16 deletions(-)
diff --git a/aider/website/_posts/2024-11-21-quantization.md b/aider/website/_posts/2024-11-21-quantization.md
index a1d060edd..f2426b9c2 100644
--- a/aider/website/_posts/2024-11-21-quantization.md
+++ b/aider/website/_posts/2024-11-21-quantization.md
@@ -12,6 +12,8 @@ nav_exclude: true
 # Details matter with open source models
 {: .no_toc }
 
+<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
+
 Open source models like Qwen 2.5 32B Instruct are performing very well on
 aider's code editing benchmark, rivaling closed source frontier models.
 
@@ -21,44 +23,56 @@ Open source models are often available at a variety of quantizations,
 and can be served with different token limits.
 These details matter when working with code.
 
-The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
+The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
 served both locally and from a variety of cloud providers.
 
 - The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
 - [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
 - The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization.
-- Results from individual providers served via OpenRouter and directly to their own APIs.
-- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
+- Results from OpenRouter's providers, both served via OpenRouter and directly to their own APIs.
+- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M) with 8k+
+context windows.
+- An Ollama fp16 quantization served with Ollama's default 2k context window.
 
-This benchmarking effort highlighted a number of pitfalls and details which
-can have a significant impact on the model's ability to correctly edit code:
+### Pitfalls and details
+
+This benchmarking effort highlighted a number of pitfalls and details specific to open source
+models which
+can have a significant impact on their ability to correctly edit code:
 
 - **Quantization** -- Open source models are often available at dozens of different quantizations.
 Most seem to only modestly decrease code editing skill, but stronger quantizations
 do have a real impact.
 - **Context window** -- Cloud providers can decide how large a context window to accept,
-and they often choose differently. Ollama defaults to a tiny 2k context window,
+and they often choose differently. Ollama's local API server
+defaults to a tiny 2k context window,
 and silently discards data that exceeds it. Such a small window has
-catastrophic effects on performance.
+catastrophic effects on performance, without throwing obvious hard errors.
 - **Output token limits** -- Open source models are often served with wildly
 differing output token limits. This has a direct impact on how much code the
 model can write or edit in a response.
-- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
-and DeepSeek V2.5, there were
+- **Buggy cloud providers** -- While benchmarking Qwen 2.5 Coder 32B Instruct
+and DeepSeek V2.5, I discovered
 multiple cloud providers with broken or buggy API endpoints.
 They seemed
-to be returning result different from expected based on the advertised
+to be returning results different from expected based on the advertised
 quantization and context sizes.
 The harm caused to the code editing benchmark varied from serious
 to catastrophic.
+One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.
 
-The best versions of the model rival GPT-4o, while the worst performing
-quantization is more like the older GPT-4 Turbo.
-Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
+Closed source, proprietary models don't typically have these issues.
+They are owned and operated by the organization that created them,
+and typically served with specific, predictable context window and output token limits.
+Their quantization level is usually unknown, but fixed and unchanging for all users.
+
+### Conclusions
+
+The best versions of the Qwen model rival GPT-4o, while the worst performing
+quantization is more like the older GPT-4 Turbo when served competently.
+Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
 if run with Ollama's default 2k context window.
 
-
-
 ### Sections
 {: .no_toc }
 
@@ -67,7 +81,6 @@ if run with Ollama's default 2k context window.
 
 ## Benchmark results
 
-<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
 <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
 <script>
 {% include quant-chart.js %}