diff --git a/aider/website/_data/qwq.yml b/aider/website/_data/qwq.yml index a8c742312..5e80639e4 100644 --- a/aider/website/_data/qwq.yml +++ b/aider/website/_data/qwq.yml @@ -120,4 +120,51 @@ date: 2024-12-04 versions: 0.66.1.dev seconds_per_case: 414.3 - total_cost: 0.0000 \ No newline at end of file + total_cost: 0.0000 + +- dirname: 2024-09-12-19-57-35--o1-mini-whole + test_cases: 133 + model: o1-mini + edit_format: whole + commit_hash: 36fa773-dirty, 291b456 + pass_rate_1: 49.6 + pass_rate_2: 70.7 + percent_cases_well_formed: 90.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 17 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model o1-mini + date: 2024-09-12 + versions: 0.56.1.dev + seconds_per_case: 103.0 + total_cost: 5.3725 + +- dirname: 2024-09-21-16-45-11--o1-preview-flex-sr-markers + test_cases: 133 + model: o1-preview + _released: 2024-09-12 + edit_format: diff + commit_hash: 5493654-dirty + pass_rate_1: 57.9 + pass_rate_2: 79.7 + percent_cases_well_formed: 93.2 + error_outputs: 11 + num_malformed_responses: 11 + num_with_malformed_responses: 9 + user_asks: 3 + lazy_comments: 0 + syntax_errors: 10 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model o1-preview + date: 2024-09-21 + versions: 0.56.1.dev + seconds_per_case: 80.9 + total_cost: 63.9190 diff --git a/aider/website/_posts/2024-12-03-qwq.md b/aider/website/_posts/2024-12-03-qwq.md index 64cfd2bd6..523e599b2 100644 --- a/aider/website/_posts/2024-12-03-qwq.md +++ b/aider/website/_posts/2024-12-03-qwq.md @@ -16,21 +16,26 @@ nav_exclude: true QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before rendering a final response. -In this way, it is similar to OpenAI's o1 models which are best used by -[pairing the reasoning model as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html). +This is similar to OpenAI's o1 models, which are most effective with aider +[when paired as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html). +In this mode, the reasoning model acts as an "architect" to propose a solution to the +coding problem without regard for how to actually make edits to the source files. +The "editor" model receives that proposal, and focuses solely on how to +edit the existing source code to implement it. -Used alone, QwQ was unable to comply with even the simplest editing format. -So it was not very successful at editing source code files. -QwQ's solo score on the benchmark was underwhelming, -far worse than the o1 models performing solo. +Used alone without being paired with an editor, +QwQ was unable to comply with even the simplest editing format. +It was not able to reliably edit source code files. +As a result, QwQ's solo score on the benchmark was quite underwhelming +(and far worse than the o1 models performing solo). -QwQ can perform better than the -Qwen 2.5 Coder 32B Instruct model that it is based on -when they are paired as architect + editor. -This provides only a modest benefit, -but results in a fairly slow overall response time. +QwQ is based on +Qwen 2.5 Coder 32B Instruct, +and does better when paired with it as an architect + editor combo. +Though this provided only a modest benchmark improvement over just using Qwen alone, +and comes with a fairly high cost in terms of latency. Each request must wait for QwQ to return all its thinking text -and the ultimate solution. +and the final solution proposal. And then one must wait for Qwen to turn that large response into actual file edits. @@ -38,7 +43,7 @@ Pairing QwQ with other sensible editor models performed the same or worse than just using Qwen 2.5 Coder 32B Instruct alone. QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%. -That is well off the +That is well below the SOTA results for this benchmark: Sonnet alone scores 84%, and o1-preview + o1-mini as architect + editor scores 85%.