mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-29 08:44:59 +00:00
copy
This commit is contained in:
parent
7a34a2dfa9
commit
68be6c5742
1 changed files with 24 additions and 9 deletions
|
@ -1,6 +1,6 @@
|
||||||
---
|
---
|
||||||
title: Quantization matters
|
title: Details matter with open source models
|
||||||
excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill.
|
excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
|
||||||
highlight_image: /assets/quantization.jpg
|
highlight_image: /assets/quantization.jpg
|
||||||
draft: false
|
draft: false
|
||||||
nav_exclude: true
|
nav_exclude: true
|
||||||
|
@ -9,18 +9,20 @@ nav_exclude: true
|
||||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
# Quantization matters
|
# Details matter with open source models
|
||||||
{: .no_toc }
|
{: .no_toc }
|
||||||
|
|
||||||
Open source models like Qwen 2.5 32B Instruct are performing very well on
|
Open source models like Qwen 2.5 32B Instruct are performing very well on
|
||||||
aider's code editing benchmark, rivaling closed source frontier models.
|
aider's code editing benchmark, rivaling closed source frontier models.
|
||||||
But pay attention to how your model is being quantized, as it
|
|
||||||
can impact code editing skill.
|
But pay attention to how your model is being served and quantized,
|
||||||
Heavily quantized models are often used by cloud API providers
|
as it can impact code editing skill.
|
||||||
and local model servers like Ollama or MLX.
|
Open source models are often available at a variety of quantizations,
|
||||||
|
and can be served with different token limits.
|
||||||
|
These details matter when working with code.
|
||||||
|
|
||||||
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
|
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
|
||||||
served both locally and from cloud providers.
|
served both locally and from a variety of cloud providers.
|
||||||
|
|
||||||
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
|
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
|
||||||
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
|
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
|
||||||
|
@ -28,8 +30,21 @@ served both locally and from cloud providers.
|
||||||
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
||||||
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
||||||
|
|
||||||
The best version of the model rivals GPT-4o, while the worst performer
|
The best versions of the model rival GPT-4o, while the worst performer
|
||||||
is more like the older GPT-4 Turbo.
|
is more like the older GPT-4 Turbo.
|
||||||
|
Suboptimal choices in quantization and token limits can
|
||||||
|
easily produce far worse results.
|
||||||
|
|
||||||
|
This benchmarking effort highlighted a number of pitfalls and details which
|
||||||
|
can have a significant impact on the model's ability to correctly edit code:
|
||||||
|
|
||||||
|
- Quantization -- Open source models are often available at dozens of different quantizations.
|
||||||
|
- Context window -- Cloud providers can decide how large a context window to accept,
|
||||||
|
and they often choose differently. Ollama defaults to a tiny 2k context window,
|
||||||
|
and silently discards data that exceeds it.
|
||||||
|
- Output token limits -- Open source models are often served with wildly
|
||||||
|
differing output token limits. This has a direct impact on how much code the
|
||||||
|
model can write or edit in a response.
|
||||||
|
|
||||||
### Sections
|
### Sections
|
||||||
{: .no_toc }
|
{: .no_toc }
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue