Merge branch 'main' into watch

This commit is contained in:
Paul Gauthier 2024-11-27 15:55:01 -08:00
commit f7a05cf077

View file

@ -12,6 +12,8 @@ nav_exclude: true
# Details matter with open source models
{: .no_toc }
<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
Open source models like Qwen 2.5 32B Instruct are performing very well on
aider's code editing benchmark, rivaling closed source frontier models.
@ -21,44 +23,56 @@ Open source models are often available at a variety of quantizations,
and can be served with different token limits.
These details matter when working with code.
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
served both locally and from a variety of cloud providers.
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
- The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization.
- Results from individual providers served via OpenRouter and directly to their own APIs.
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
- Results from OpenRouter's providers, both served via OpenRouter and directly to their own APIs.
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M) with 8k+
context windows.
- An Ollama fp16 quantization served with Ollama's default 2k context window.
This benchmarking effort highlighted a number of pitfalls and details which
can have a significant impact on the model's ability to correctly edit code:
### Pitfalls and details
This benchmarking effort highlighted a number of pitfalls and details specific to open source
models which
can have a significant impact on their ability to correctly edit code:
- **Quantization** -- Open source models are often available at dozens of different quantizations.
Most seem to only modestly decrease code editing skill, but stronger quantizations
do have a real impact.
- **Context window** -- Cloud providers can decide how large a context window to accept,
and they often choose differently. Ollama defaults to a tiny 2k context window,
and they often choose differently. Ollama's local API server
defaults to a tiny 2k context window,
and silently discards data that exceeds it. Such a small window has
catastrophic effects on performance.
catastrophic effects on performance, without throwing obvious hard errors.
- **Output token limits** -- Open source models are often served with wildly
differing output token limits. This has a direct impact on how much code the
model can write or edit in a response.
- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
and DeepSeek V2.5, there were
- **Buggy cloud providers** -- While benchmarking Qwen 2.5 Coder 32B Instruct
and DeepSeek V2.5, I discovered
multiple cloud providers with broken or buggy API endpoints.
They seemed
to be returning result different from expected based on the advertised
to be returning results different from expected based on the advertised
quantization and context sizes.
The harm caused to the code editing benchmark varied from serious
to catastrophic.
One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.
The best versions of the model rival GPT-4o, while the worst performing
quantization is more like the older GPT-4 Turbo.
Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
Closed source, proprietary models don't typically have these issues.
They are owned and operated by the organization that created them,
and typically served with specific, predictable context window and output token limits.
Their quantization level is usually unknown, but fixed and unchanging for all users.
### Conclusions
The best versions of the Qwen model rival GPT-4o, while the worst performing
quantization is more like the older GPT-4 Turbo when served competently.
Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
if run with Ollama's default 2k context window.
### Sections
{: .no_toc }
@ -67,7 +81,6 @@ if run with Ollama's default 2k context window.
## Benchmark results
<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script>
{% include quant-chart.js %}