From d54fbd6592c224c62581f2efc3516b684c123420 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 27 Nov 2024 15:23:13 -0800 Subject: [PATCH] copy --- .../website/_posts/2024-11-21-quantization.md | 45 ++++++++++++------- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/aider/website/_posts/2024-11-21-quantization.md b/aider/website/_posts/2024-11-21-quantization.md index a1d060edd..f2426b9c2 100644 --- a/aider/website/_posts/2024-11-21-quantization.md +++ b/aider/website/_posts/2024-11-21-quantization.md @@ -12,6 +12,8 @@ nav_exclude: true # Details matter with open source models {: .no_toc } + + Open source models like Qwen 2.5 32B Instruct are performing very well on aider's code editing benchmark, rivaling closed source frontier models. @@ -21,44 +23,56 @@ Open source models are often available at a variety of quantizations, and can be served with different token limits. These details matter when working with code. -The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, +The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, served both locally and from a variety of cloud providers. - The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat). - [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D). - The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization. -- Results from individual providers served via OpenRouter and directly to their own APIs. -- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M). +- Results from OpenRouter's providers, both served via OpenRouter and directly to their own APIs. +- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M) with 8k+ +context windows. +- An Ollama fp16 quantization served with Ollama's default 2k context window. -This benchmarking effort highlighted a number of pitfalls and details which -can have a significant impact on the model's ability to correctly edit code: +### Pitfalls and details + +This benchmarking effort highlighted a number of pitfalls and details specific to open source +models which +can have a significant impact on their ability to correctly edit code: - **Quantization** -- Open source models are often available at dozens of different quantizations. Most seem to only modestly decrease code editing skill, but stronger quantizations do have a real impact. - **Context window** -- Cloud providers can decide how large a context window to accept, -and they often choose differently. Ollama defaults to a tiny 2k context window, +and they often choose differently. Ollama's local API server +defaults to a tiny 2k context window, and silently discards data that exceeds it. Such a small window has -catastrophic effects on performance. +catastrophic effects on performance, without throwing obvious hard errors. - **Output token limits** -- Open source models are often served with wildly differing output token limits. This has a direct impact on how much code the model can write or edit in a response. -- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct -and DeepSeek V2.5, there were +- **Buggy cloud providers** -- While benchmarking Qwen 2.5 Coder 32B Instruct +and DeepSeek V2.5, I discovered multiple cloud providers with broken or buggy API endpoints. They seemed -to be returning result different from expected based on the advertised +to be returning results different from expected based on the advertised quantization and context sizes. The harm caused to the code editing benchmark varied from serious to catastrophic. +One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model. -The best versions of the model rival GPT-4o, while the worst performing -quantization is more like the older GPT-4 Turbo. -Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance +Closed source, proprietary models don't typically have these issues. +They are owned and operated by the organization that created them, +and typically served with specific, predictable context window and output token limits. +Their quantization level is usually unknown, but fixed and unchanging for all users. + +### Conclusions + +The best versions of the Qwen model rival GPT-4o, while the worst performing +quantization is more like the older GPT-4 Turbo when served competently. +Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance if run with Ollama's default 2k context window. - - ### Sections {: .no_toc } @@ -67,7 +81,6 @@ if run with Ollama's default 2k context window. ## Benchmark results -