mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-28 00:05:01 +00:00
copy
This commit is contained in:
parent
7a34a2dfa9
commit
68be6c5742
1 changed files with 24 additions and 9 deletions
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
title: Quantization matters
|
||||
excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill.
|
||||
title: Details matter with open source models
|
||||
excerpt: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
|
||||
highlight_image: /assets/quantization.jpg
|
||||
draft: false
|
||||
nav_exclude: true
|
||||
|
@ -9,18 +9,20 @@ nav_exclude: true
|
|||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Quantization matters
|
||||
# Details matter with open source models
|
||||
{: .no_toc }
|
||||
|
||||
Open source models like Qwen 2.5 32B Instruct are performing very well on
|
||||
aider's code editing benchmark, rivaling closed source frontier models.
|
||||
But pay attention to how your model is being quantized, as it
|
||||
can impact code editing skill.
|
||||
Heavily quantized models are often used by cloud API providers
|
||||
and local model servers like Ollama or MLX.
|
||||
|
||||
But pay attention to how your model is being served and quantized,
|
||||
as it can impact code editing skill.
|
||||
Open source models are often available at a variety of quantizations,
|
||||
and can be served with different token limits.
|
||||
These details matter when working with code.
|
||||
|
||||
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
|
||||
served both locally and from cloud providers.
|
||||
served both locally and from a variety of cloud providers.
|
||||
|
||||
- The [HuggingFace BF16 weights](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) served via [glhf.chat](https://glhf.chat).
|
||||
- [4bit and 8bit quants for mlx](https://t.co/cwX3DYX35D).
|
||||
|
@ -28,8 +30,21 @@ served both locally and from cloud providers.
|
|||
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
||||
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
||||
|
||||
The best version of the model rivals GPT-4o, while the worst performer
|
||||
The best versions of the model rival GPT-4o, while the worst performer
|
||||
is more like the older GPT-4 Turbo.
|
||||
Suboptimal choices in quantization and token limits can
|
||||
easily produce far worse results.
|
||||
|
||||
This benchmarking effort highlighted a number of pitfalls and details which
|
||||
can have a significant impact on the model's ability to correctly edit code:
|
||||
|
||||
- Quantization -- Open source models are often available at dozens of different quantizations.
|
||||
- Context window -- Cloud providers can decide how large a context window to accept,
|
||||
and they often choose differently. Ollama defaults to a tiny 2k context window,
|
||||
and silently discards data that exceeds it.
|
||||
- Output token limits -- Open source models are often served with wildly
|
||||
differing output token limits. This has a direct impact on how much code the
|
||||
model can write or edit in a response.
|
||||
|
||||
### Sections
|
||||
{: .no_toc }
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue