mirror of https://github.com/Aider-AI/aider.git synced 2025-06-12 23:54:59 +00:00

Paul Gauthier cad31b638b copy

2025-05-09 15:57:04 -07:00

14 KiB

Raw Blame History

layout	title	excerpt	highlight_image	date
post	Qwen3 benchmark results	Benchmark results for Qwen3 models using the Aider polyglot coding benchmark.	/assets/2025-05-08-qwen3.jpg	2025-05-08

Qwen3 results on the aider polyglot benchmark

As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.

Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both "diff" and "whole" edit formats, with various models settings, against various API providers.

See details on the model settings used after the results table.

{: .note } This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.

Qwen3 results on the aider polyglot benchmark

{% assign max_cost = 0 %} {% for row in site.data.qwen3_leaderboard %} {% if row.total_cost > max_cost %} {% assign max_cost = row.total_cost %} {% endif %} {% endfor %} {% if max_cost == 0 %}{% assign max_cost = 1 %}{% endif %} {% assign edit_sorted = site.data.qwen3_leaderboard | sort: 'pass_rate_2' | reverse %} {% for row in edit_sorted %} {% comment %} Add loop index for unique IDs {% endcomment %} {% assign row_index = forloop.index0 %} {% endfor %}

	Model	Percent correct	Cost	Command	Correct edit format	Edit Format
	{{ row.model }}	{{ row.pass_rate_2 }}%	{% if row.total_cost > 0 %} {% endif %} {% assign rounded_cost = row.total_cost \| times: 1.0 \| round: 2 %} {% if row.total_cost == 0 or rounded_cost == 0.00 %}{% else %}${{ rounded_cost }}{% endif %}	`{{ row.command }}`	{{ row.percent_cases_well_formed }}%	{{ row.edit_format }}
{% for pair in row %} {% if pair[1] != "" and pair[1] != nil %} {% if pair[0] == 'percent_cases_well_formed' %} Percent cases well formed {% else %} {{ pair[0] \| replace: '_', ' ' \| capitalize }} {% endif %} : {% if pair[0] == 'command' %}`{{ pair[1] }}`{% else %}{{ pair[1] }}{% endif %} {% endif %} {% endfor %}

No think, via official Alibaba API

These results were obtained running against https://dashscope.aliyuncs.com/compatible-mode/v1 with no thinking.

export OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<key>

- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
    extra_body:
      enable_thinking: false

OpenRouter only TogetherAI, recommended /no_think settings

These results were obtained with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7
    extra_body:
      provider:
        order: ["Together"]

And then running aider:

aider --model openrouter/qwen/qwen3-235b-a22b

OpenRouter, all providers, default settings (thinking)

These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.

aider --model openrouter/qwen/qwen3-xxx

VLLM, bfloat16, recommended /no_think

These benchmarks results were obtained by GitHub user AlongWY with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openai/<model-name>
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

And then running aider:

aider --model openai/<model-name> --openai-api-base <url>

14 KiB Raw Blame History Unescape Escape

Qwen3 results on the aider polyglot benchmark

Qwen3 results on the aider polyglot benchmark

No think, via official Alibaba API

OpenRouter only TogetherAI, recommended /no_think settings

OpenRouter, all providers, default settings (thinking)

VLLM, bfloat16, recommended /no_think

14 KiB

Raw Blame History