5.7 KiB
title | excerpt | highlight_image | draft | nav_exclude |
---|---|---|---|---|
Quantization matters | Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) is quantizing the model. It can affect code editing skill. | /assets/quantization.jpg | false | true |
{% if page.date %}
{{ page.date | date: "%B %d, %Y" }}
{% endif %}Quantization matters
{: .no_toc }
Open source models like Qwen 2.5 32B Instruct are performing very well on aider's code editing benchmark, rivaling closed source frontier models. But pay attention to how your model is being quantized, as it can impact code editing skill. Heavily quantized models are often used by cloud API providers and local model servers like Ollama or MLX.
The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, served both locally and from cloud providers.
- The HuggingFace BF16 weights served via glhf.chat.
- 4bit and 8bit quants for mlx.
- The results from OpenRouter's mix of providers which serve the model with different levels of quantization.
- Ollama locally serving different quantizations from the Ollama model library.
- Other API providers.
The best version of the model rivals GPT-4o, while the worst performer is more like the older GPT-4 Turbo.
Sections
{: .no_toc }
- TOC {:toc}
Benchmark results
{% assign quant_sorted = site.data.quant | sort: 'pass_rate_2' | reverse %} {% for row in quant_sorted %} {% endfor %}Model | Percent completed correctly | Percent using correct edit format | Command | Edit format |
---|---|---|---|---|
{{ row.model }} | {{ row.pass_rate_2 }}% | {{ row.percent_cases_well_formed }}% | {{ row.command }} |
{{ row.edit_format }} |
Setting Ollama's context window size
Ollama uses a 2k context window by default, which is very small for working with aider. Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the "oldest" messages in the chat to make it fit within the context window.
All of the Ollama results above were collected with at least an 8k context window, which is large enough to attempt all the coding problems in the benchmark.
You can set the Ollama server's context window with a
.aider.model.settings.yml
file
like this:
- name: aider/extra_params
extra_params:
num_ctx: 8192
That uses the special model name aider/extra_params
to set it for all models. You should probably use a specific model name like:
- name: ollama/qwen2.5-coder:32b-instruct-fp16
extra_params:
num_ctx: 8192
Choosing providers with OpenRouter
OpenRouter allows you to ignore specific providers in your preferences. This can be used to limit your OpenRouter requests to be served by only your preferred providers.
Notes
This article went through many revisions as I received feedback from numerous members of the community. Here are some of the noteworthy learnings and changes:
- The first version of this article included incorrect Ollama models.
- Earlier Ollama results used the too small default 2k context window, artificially harming the benchmark results.
- The benchmark results appear to have uncovered a problem in the way OpenRouter was communicating with Hyperbolic. They fixed the issue 11/24/24, shortly after it was pointed out.