diff --git a/aider/website/_data/qwq.yml b/aider/website/_data/qwq.yml new file mode 100644 index 000000000..a8c742312 --- /dev/null +++ b/aider/website/_data/qwq.yml @@ -0,0 +1,123 @@ + +- dirname: 2024-11-28-21-38-50--architect-qwq-haiku-whole + test_cases: 133 + model: QwQ + Haiku + edit_format: architect + commit_hash: e4a1d6f + editor_model: claude-3-5-haiku-20241022 + editor_edit_format: editor-whole + pass_rate_1: 54.1 + pass_rate_2: 71.4 + percent_cases_well_formed: 100.0 + error_outputs: 4 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 196 + lazy_comments: 4 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole + date: 2024-11-28 + versions: 0.65.2.dev + seconds_per_case: 154.7 + total_cost: 1.4196 + +- dirname: 2024-11-28-19-24-35--architect-qwq-deepseek-whole + test_cases: 133 + model: QwQ + DeepSeek V2.5 + edit_format: architect + commit_hash: e4a1d6f + editor_model: deepseek/deepseek-chat + editor_edit_format: editor-whole + pass_rate_1: 55.6 + pass_rate_2: 67.7 + percent_cases_well_formed: 100.0 + error_outputs: 3 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 193 + lazy_comments: 2 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole + date: 2024-11-28 + versions: 0.65.2.dev + seconds_per_case: 170.3 + total_cost: 0.1558 + + +- dirname: 2024-11-09-11-09-15--Qwen2.5-Coder-32B-Instruct + test_cases: 133 + model: Qwen2.5 Coder 32B-I + released: 2024-11-12 + edit_format: diff + commit_hash: ec9982a + pass_rate_1: 59.4 + pass_rate_2: 71.4 + percent_cases_well_formed: 94.7 + error_outputs: 17 + num_malformed_responses: 17 + num_with_malformed_responses: 7 + user_asks: 1 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF) + date: 2024-11-09 + versions: 0.59.2.dev + seconds_per_case: 22.5 + total_cost: 0.0000 + +- dirname: 2024-12-04-00-10-39--architect-qwq-qwen + test_cases: 132 + model: QwQ + Qwen2.5 Coder 32B-I + edit_format: architect + commit_hash: 51c02da + editor_model: openrouter/qwen/qwen-2.5-coder-32b-instruct + editor_edit_format: editor-whole + pass_rate_1: 58.3 + pass_rate_2: 73.6 + percent_cases_well_formed: 100.0 + error_outputs: 3 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 186 + lazy_comments: 5 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole + date: 2024-12-04 + versions: 0.66.1.dev + seconds_per_case: 144.1 + total_cost: 0.1444 + +- dirname: 2024-12-04-00-42-05--qwq-alone-whole + test_cases: 133 + model: QwQ + edit_format: whole + commit_hash: 19004c0 + pass_rate_1: 33.1 + pass_rate_2: 42.1 + percent_cases_well_formed: 91.0 + error_outputs: 28 + num_malformed_responses: 12 + num_with_malformed_responses: 12 + user_asks: 119 + lazy_comments: 2 + syntax_errors: 22 + indentation_errors: 9 + exhausted_context_windows: 2 + test_timeouts: 1 + command: aider --model openrouter/qwen/qwq-32b-preview + date: 2024-12-04 + versions: 0.66.1.dev + seconds_per_case: 414.3 + total_cost: 0.0000 \ No newline at end of file diff --git a/aider/website/_includes/qwq-chart.js b/aider/website/_includes/qwq-chart.js index a6d1c7b2b..dbe707916 100644 --- a/aider/website/_includes/qwq-chart.js +++ b/aider/website/_includes/qwq-chart.js @@ -71,13 +71,6 @@ document.addEventListener('DOMContentLoaded', function () { ]; } } - }, - title: { - display: true, - text: 'Aider code editing benchmark results', - font: { - size: 16 - } } }, scales: { @@ -85,7 +78,7 @@ document.addEventListener('DOMContentLoaded', function () { beginAtZero: true, title: { display: true, - text: 'Percent completed correctly', + text: 'Aider code editing benchmark (%)', font: { size: 18 } diff --git a/aider/website/_posts/2024-12-03-qwq.md b/aider/website/_posts/2024-12-03-qwq.md new file mode 100644 index 000000000..e8a98403c --- /dev/null +++ b/aider/website/_posts/2024-12-03-qwq.md @@ -0,0 +1,133 @@ +--- +title: QwQ is a code architect, not an editor +excerpt: QwQ is reasoning model like o1, and needs to be used as an architect with another model as editor. +#highlight_image: /assets/qwqization.jpg +draft: false +nav_exclude: true +--- +{% if page.date %} +

{{ page.date | date: "%B %d, %Y" }}

+{% endif %} + +# QwQ is a code architect, not an editor +{: .no_toc } + + + +QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before +rendering a final response. +In this way, it is similar to OpenAI's o1 models which are best used by +[pairing the reasoning model as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html). + +Used alone, QwQ was unable to comply with even the simplest editing format. +So it was not very successful at editing source code files. +QwQ's solo score on the benchmark was underwhelming, +far worse than the o1 models performing solo. + +QwQ can perform better than the +Qwen 2.5 Coder 32B Instruct model that it is based on +when they are paired as architect + editor. +This provides only a modest benefit, +but results in a fairly slow overall response time. +Each request must wait for QwQ to return all its thinking text +and the ultimate solution. +And then one must wait for Qwen to turn that large +response into actual file edits. + +Pairing QwQ with other sensible editor models performed worse than +just using Qwen 2.5 Coder 32B Instruct alone. + +QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%. +That is well off the +SOTA results for this benchmark: Sonnet alone scores 84%, and +o1-preview + o1-mini as architect + editor scores 85%. + + +## QwQ specific editing formats + +I spent some time experimenting with a variety of custom editing formats +for QwQ. +In particular, I tried to parse the QwQ response and discard the long +sections of "thinking" and retain only the "final" solution. +While I was able to successfully tease these sections apart, +it did not translate to any significant improvement in the benchmarking results. + + +## Results + + + + + + + + + + + + + + + + {% assign qwq_sorted = site.data.qwq | sort: 'pass_rate_2' | reverse %} + {% for row in qwq_sorted %} + + + + + + + + {% endfor %} + +
ModelPercent completed correctlyPercent using correct edit formatCommandEdit format
{{ row.model }}{{ row.pass_rate_2 }}%{{ row.percent_cases_well_formed }}%{{ row.command }}{{ row.edit_format }}
+ + + + + +## Open source model caveats + +As discussed in a recent blog post, +[details matter with open source models](https://aider.chat/2024/11/21/quantization.html). +For clarity, I benchmarked against OpenRouter's endpoints for +QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct. +For the other models, I went direct to their provider's APIs. + +Having recently done extensive testing of OpenRouter's Qwen 2.5 Coder 32B Instruct, +I feel comfortable using it. I blocked the provider Mancer due to small +context window. + +For QwQ 32B Preview, I blocked Fireworks because of its small context window.