aider/aider/website/_posts/2025-05-08-qwen3.md
Paul Gauthier cad31b638b copy
2025-05-09 15:57:04 -07:00

14 KiB
Raw Blame History

layout title excerpt highlight_image date
post Qwen3 benchmark results Benchmark results for Qwen3 models using the Aider polyglot coding benchmark. /assets/2025-05-08-qwen3.jpg 2025-05-08

Qwen3 results on the aider polyglot benchmark

As previously discussed when Qwen2.5 was released, details matter when working with open source models for AI coding. Proprietary models are served by their creators or trusted providers with stable inference settings. Open source models are wonderful because anyone can serve them, but API providers can use very different inference settings, quantizations, etc.

Below are collection of aider polyglot benchmark results for the new Qwen3 models. Results are presented using both "diff" and "whole" edit formats, with various models settings, against various API providers.

See details on the model settings used after the results table.

{: .note } This article is being updated as new results become available. Also, some results were submitted by aider users and have not been verified.

Qwen3 results on the aider polyglot benchmark

{% assign max_cost = 0 %} {% for row in site.data.qwen3_leaderboard %} {% if row.total_cost > max_cost %} {% assign max_cost = row.total_cost %} {% endif %} {% endfor %} {% if max_cost == 0 %}{% assign max_cost = 1 %}{% endif %} {% assign edit_sorted = site.data.qwen3_leaderboard | sort: 'pass_rate_2' | reverse %} {% for row in edit_sorted %} {% comment %} Add loop index for unique IDs {% endcomment %} {% assign row_index = forloop.index0 %} {% endfor %}
Model Percent correct Cost Command Correct edit format Edit Format
{{ row.model }}
{{ row.pass_rate_2 }}%
{% if row.total_cost > 0 %}
{% endif %} {% assign rounded_cost = row.total_cost | times: 1.0 | round: 2 %} {% if row.total_cost == 0 or rounded_cost == 0.00 %}{% else %}${{ rounded_cost }}{% endif %}
{{ row.command }} {{ row.percent_cases_well_formed }}% {{ row.edit_format }}
    {% for pair in row %} {% if pair[1] != "" and pair[1] != nil %}
  • {% if pair[0] == 'percent_cases_well_formed' %} Percent cases well formed {% else %} {{ pair[0] | replace: '_', ' ' | capitalize }} {% endif %} : {% if pair[0] == 'command' %}{{ pair[1] }}{% else %}{{ pair[1] }}{% endif %}
  • {% endif %} {% endfor %}

No think, via official Alibaba API

These results were obtained running against https://dashscope.aliyuncs.com/compatible-mode/v1 with no thinking.

export OPENAI_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
export OPENAI_API_KEY=<key>
- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
    extra_body:
      enable_thinking: false

These results were obtained with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7
    extra_body:
      provider:
        order: ["Together"]

And then running aider:

aider --model openrouter/qwen/qwen3-235b-a22b

OpenRouter, all providers, default settings (thinking)

These results were obtained by simply running aider as shown below, without any model specific settings. This should have enabled thinking, assuming upstream API providers honor that convention for Qwen3.

aider --model openrouter/qwen/qwen3-xxx

These benchmarks results were obtained by GitHub user AlongWY with the recommended non-thinking model settings in .aider.model.settings.yml:

- name: openai/<model-name>
  system_prompt_prefix: "/no_think"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7        

And then running aider:

aider --model openai/<model-name> --openai-api-base <url>