diff --git a/aider/website/_posts/2024-11-21-quantization.md b/aider/website/_posts/2024-11-21-quantization.md
index a1d060edd..f2426b9c2 100644
--- a/aider/website/_posts/2024-11-21-quantization.md
+++ b/aider/website/_posts/2024-11-21-quantization.md
@@ -12,6 +12,8 @@ nav_exclude: true
# Details matter with open source models
{: .no_toc }
+
+
Open source models like Qwen 2.5 32B Instruct are performing very well on
aider's code editing benchmark, rivaling closed source frontier models.
@@ -21,44 +23,56 @@ Open source models are often available at a variety of quantizations,
and can be served with different token limits.
These details matter when working with code.
-The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
+The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
served both locally and from a variety of cloud providers.
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
- The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization.
-- Results from individual providers served via OpenRouter and directly to their own APIs.
-- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
+- Results from OpenRouter's providers, both served via OpenRouter and directly to their own APIs.
+- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M) with 8k+
+context windows.
+- An Ollama fp16 quantization served with Ollama's default 2k context window.
-This benchmarking effort highlighted a number of pitfalls and details which
-can have a significant impact on the model's ability to correctly edit code:
+### Pitfalls and details
+
+This benchmarking effort highlighted a number of pitfalls and details specific to open source
+models which
+can have a significant impact on their ability to correctly edit code:
- **Quantization** -- Open source models are often available at dozens of different quantizations.
Most seem to only modestly decrease code editing skill, but stronger quantizations
do have a real impact.
- **Context window** -- Cloud providers can decide how large a context window to accept,
-and they often choose differently. Ollama defaults to a tiny 2k context window,
+and they often choose differently. Ollama's local API server
+defaults to a tiny 2k context window,
and silently discards data that exceeds it. Such a small window has
-catastrophic effects on performance.
+catastrophic effects on performance, without throwing obvious hard errors.
- **Output token limits** -- Open source models are often served with wildly
differing output token limits. This has a direct impact on how much code the
model can write or edit in a response.
-- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
-and DeepSeek V2.5, there were
+- **Buggy cloud providers** -- While benchmarking Qwen 2.5 Coder 32B Instruct
+and DeepSeek V2.5, I discovered
multiple cloud providers with broken or buggy API endpoints.
They seemed
-to be returning result different from expected based on the advertised
+to be returning results different from expected based on the advertised
quantization and context sizes.
The harm caused to the code editing benchmark varied from serious
to catastrophic.
+One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.
-The best versions of the model rival GPT-4o, while the worst performing
-quantization is more like the older GPT-4 Turbo.
-Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
+Closed source, proprietary models don't typically have these issues.
+They are owned and operated by the organization that created them,
+and typically served with specific, predictable context window and output token limits.
+Their quantization level is usually unknown, but fixed and unchanging for all users.
+
+### Conclusions
+
+The best versions of the Qwen model rival GPT-4o, while the worst performing
+quantization is more like the older GPT-4 Turbo when served competently.
+Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
if run with Ollama's default 2k context window.
-
-
### Sections
{: .no_toc }
@@ -67,7 +81,6 @@ if run with Ollama's default 2k context window.
## Benchmark results
-