14 KiB
layout | title | excerpt | highlight_image | date |
---|---|---|---|---|
post | Qwen3 benchmark results | Benchmark results for Qwen3 models using the Aider polyglot coding benchmark. | /assets/2025-05-08-qwen3.jpg | 2025-05-08 |
Qwen3 results on the aider polyglot benchmark
As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.
Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both "diff" and "whole" edit formats, with various models settings, against various API providers.
See details on the model settings used after the results table.
{: .note } This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.
Qwen3 results on the aider polyglot benchmark
Model | Percent correct | Cost | Command | Correct edit format | Edit Format | |
---|---|---|---|---|---|---|
{{ row.model }} | {{ row.pass_rate_2 }}% | {% if row.total_cost > 0 %} {% endif %} {% assign rounded_cost = row.total_cost | times: 1.0 | round: 2 %} {% if row.total_cost == 0 or rounded_cost == 0.00 %}{% else %}${{ rounded_cost }}{% endif %} | {{ row.command }} |
{{ row.percent_cases_well_formed }}% | {{ row.edit_format }} | |
|
No think, via official Alibaba API
These results were obtained running against https://dashscope.aliyuncs.com/compatible-mode/v1
with no thinking.
export OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<key>
- name: openai/qwen3-235b-a22b
use_temperature: 0.7
streaming: false
extra_params:
stream: false
max_tokens: 16384
top_p: 0.8
top_k: 20
temperature: 0.7
enable_thinking: false
extra_body:
enable_thinking: false
OpenRouter only TogetherAI, recommended /no_think settings
These results were obtained with the
recommended
non-thinking model settings in .aider.model.settings.yml
:
- name: openrouter/qwen/qwen3-235b-a22b
system_prompt_prefix: "/no_think"
use_temperature: 0.7
extra_params:
max_tokens: 24000
top_p: 0.8
top_k: 20
min_p: 0.0
temperature: 0.7
extra_body:
provider:
order: ["Together"]
And then running aider:
aider --model openrouter/qwen/qwen3-235b-a22b
OpenRouter, all providers, default settings (thinking)
These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.
aider --model openrouter/qwen/qwen3-xxx
VLLM, bfloat16, recommended /no_think
These benchmarks results were obtained by GitHub user AlongWY
with the
recommended
non-thinking model settings in .aider.model.settings.yml
:
- name: openai/<model-name>
system_prompt_prefix: "/no_think"
use_temperature: 0.7
extra_params:
max_tokens: 24000
top_p: 0.8
top_k: 20
min_p: 0.0
temperature: 0.7
And then running aider:
aider --model openai/<model-name> --openai-api-base <url>