copy

2025-05-31 09:44:59 +00:00 · 2024-12-04 06:38:38 -08:00 · 2024-12-04 06:38:38 -08:00 · 0d983d504b
commit 0d983d504b
parent f26ccfa3e9
2 changed files with 66 additions and 14 deletions
--- a/aider/website/_data/qwq.yml
+++ b/aider/website/_data/qwq.yml
@ -120,4 +120,51 @@
  date: 2024-12-04
  versions: 0.66.1.dev
  seconds_per_case: 414.3
-  total_cost: 0.0000
+  total_cost: 0.0000
+
+- dirname: 2024-09-12-19-57-35--o1-mini-whole
+  test_cases: 133
+  model: o1-mini
+  edit_format: whole
+  commit_hash: 36fa773-dirty, 291b456
+  pass_rate_1: 49.6
+  pass_rate_2: 70.7
+  percent_cases_well_formed: 90.0
+  error_outputs: 0
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 17
+  lazy_comments: 0
+  syntax_errors: 0
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 1
+  command: aider --model o1-mini
+  date: 2024-09-12
+  versions: 0.56.1.dev
+  seconds_per_case: 103.0
+  total_cost: 5.3725
+
+- dirname: 2024-09-21-16-45-11--o1-preview-flex-sr-markers
+  test_cases: 133
+  model: o1-preview
+  _released: 2024-09-12
+  edit_format: diff
+  commit_hash: 5493654-dirty
+  pass_rate_1: 57.9
+  pass_rate_2: 79.7
+  percent_cases_well_formed: 93.2
+  error_outputs: 11
+  num_malformed_responses: 11
+  num_with_malformed_responses: 9
+  user_asks: 3
+  lazy_comments: 0
+  syntax_errors: 10
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 1
+  command: aider --model o1-preview
+  date: 2024-09-21
+  versions: 0.56.1.dev
+  seconds_per_case: 80.9
+  total_cost: 63.9190
--- a/aider/website/_posts/2024-12-03-qwq.md
+++ b/aider/website/_posts/2024-12-03-qwq.md
@ -16,21 +16,26 @@ nav_exclude: true

 QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before
 rendering a final response.
-In this way, it is similar to OpenAI's o1 models which are best used by
-[pairing the reasoning model as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
+This is similar to OpenAI's o1 models, which are most effective with aider
+[when paired as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
+In this mode, the reasoning model acts as an "architect" to propose a solution to the
+coding problem without regard for how to actually make edits to the source files.
+The "editor" model receives that proposal, and focuses solely on how to
+edit the existing source code to implement it.

-Used alone, QwQ was unable to comply with even the simplest editing format.
-So it was not very successful at editing source code files.
-QwQ's solo score on the benchmark was underwhelming,
-far worse than the o1 models performing solo.
+Used alone without being paired with an editor, 
+QwQ was unable to comply with even the simplest editing format.
+It was not able to reliably edit source code files.
+As a result, QwQ's solo score on the benchmark was quite underwhelming
+(and far worse than the o1 models performing solo).

-QwQ can perform better than the
-Qwen 2.5 Coder 32B Instruct model that it is based on
-when they are paired as architect + editor.
-This provides only a modest benefit,
-but results in a fairly slow overall response time.
+QwQ is based on
+Qwen 2.5 Coder 32B Instruct,
+and does better when paired with it as an architect + editor combo.
+Though this provided only a modest benchmark improvement over just using Qwen alone,
+and comes with a fairly high cost in terms of latency.
 Each request must wait for QwQ to return all its thinking text
-and the ultimate solution.
+and the final solution proposal.
 And then one must wait for Qwen to turn that large
 response into actual file edits.

@ -38,7 +43,7 @@ Pairing QwQ with other sensible editor models performed the same or worse than
 just using Qwen 2.5 Coder 32B Instruct alone.

 QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
-That is well off the
+That is well below the
 SOTA results for this benchmark: Sonnet alone scores 84%, and
 o1-preview + o1-mini as architect + editor scores 85%.