mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-30 17:24:59 +00:00
Merge branch 'main' into watch
This commit is contained in:
commit
f7a05cf077
1 changed files with 29 additions and 16 deletions
|
@ -12,6 +12,8 @@ nav_exclude: true
|
|||
# Details matter with open source models
|
||||
{: .no_toc }
|
||||
|
||||
<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
|
||||
|
||||
Open source models like Qwen 2.5 32B Instruct are performing very well on
|
||||
aider's code editing benchmark, rivaling closed source frontier models.
|
||||
|
||||
|
@ -21,44 +23,56 @@ Open source models are often available at a variety of quantizations,
|
|||
and can be served with different token limits.
|
||||
These details matter when working with code.
|
||||
|
||||
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
|
||||
The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
|
||||
served both locally and from a variety of cloud providers.
|
||||
|
||||
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
|
||||
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
|
||||
- The results from [OpenRouter's mix of providers](https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers) which serve the model with different levels of quantization.
|
||||
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
||||
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
||||
- Results from OpenRouter's providers, both served via OpenRouter and directly to their own APIs.
|
||||
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M) with 8k+
|
||||
context windows.
|
||||
- An Ollama fp16 quantization served with Ollama's default 2k context window.
|
||||
|
||||
This benchmarking effort highlighted a number of pitfalls and details which
|
||||
can have a significant impact on the model's ability to correctly edit code:
|
||||
### Pitfalls and details
|
||||
|
||||
This benchmarking effort highlighted a number of pitfalls and details specific to open source
|
||||
models which
|
||||
can have a significant impact on their ability to correctly edit code:
|
||||
|
||||
- **Quantization** -- Open source models are often available at dozens of different quantizations.
|
||||
Most seem to only modestly decrease code editing skill, but stronger quantizations
|
||||
do have a real impact.
|
||||
- **Context window** -- Cloud providers can decide how large a context window to accept,
|
||||
and they often choose differently. Ollama defaults to a tiny 2k context window,
|
||||
and they often choose differently. Ollama's local API server
|
||||
defaults to a tiny 2k context window,
|
||||
and silently discards data that exceeds it. Such a small window has
|
||||
catastrophic effects on performance.
|
||||
catastrophic effects on performance, without throwing obvious hard errors.
|
||||
- **Output token limits** -- Open source models are often served with wildly
|
||||
differing output token limits. This has a direct impact on how much code the
|
||||
model can write or edit in a response.
|
||||
- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
|
||||
and DeepSeek V2.5, there were
|
||||
- **Buggy cloud providers** -- While benchmarking Qwen 2.5 Coder 32B Instruct
|
||||
and DeepSeek V2.5, I discovered
|
||||
multiple cloud providers with broken or buggy API endpoints.
|
||||
They seemed
|
||||
to be returning result different from expected based on the advertised
|
||||
to be returning results different from expected based on the advertised
|
||||
quantization and context sizes.
|
||||
The harm caused to the code editing benchmark varied from serious
|
||||
to catastrophic.
|
||||
One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.
|
||||
|
||||
The best versions of the model rival GPT-4o, while the worst performing
|
||||
quantization is more like the older GPT-4 Turbo.
|
||||
Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
|
||||
Closed source, proprietary models don't typically have these issues.
|
||||
They are owned and operated by the organization that created them,
|
||||
and typically served with specific, predictable context window and output token limits.
|
||||
Their quantization level is usually unknown, but fixed and unchanging for all users.
|
||||
|
||||
### Conclusions
|
||||
|
||||
The best versions of the Qwen model rival GPT-4o, while the worst performing
|
||||
quantization is more like the older GPT-4 Turbo when served competently.
|
||||
Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
|
||||
if run with Ollama's default 2k context window.
|
||||
|
||||
|
||||
|
||||
### Sections
|
||||
{: .no_toc }
|
||||
|
||||
|
@ -67,7 +81,6 @@ if run with Ollama's default 2k context window.
|
|||
|
||||
## Benchmark results
|
||||
|
||||
<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>
|
||||
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
||||
<script>
|
||||
{% include quant-chart.js %}
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue