mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-31 09:44:59 +00:00
copy
This commit is contained in:
parent
f26ccfa3e9
commit
0d983d504b
2 changed files with 66 additions and 14 deletions
|
@ -120,4 +120,51 @@
|
|||
date: 2024-12-04
|
||||
versions: 0.66.1.dev
|
||||
seconds_per_case: 414.3
|
||||
total_cost: 0.0000
|
||||
total_cost: 0.0000
|
||||
|
||||
- dirname: 2024-09-12-19-57-35--o1-mini-whole
|
||||
test_cases: 133
|
||||
model: o1-mini
|
||||
edit_format: whole
|
||||
commit_hash: 36fa773-dirty, 291b456
|
||||
pass_rate_1: 49.6
|
||||
pass_rate_2: 70.7
|
||||
percent_cases_well_formed: 90.0
|
||||
error_outputs: 0
|
||||
num_malformed_responses: 0
|
||||
num_with_malformed_responses: 0
|
||||
user_asks: 17
|
||||
lazy_comments: 0
|
||||
syntax_errors: 0
|
||||
indentation_errors: 0
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 1
|
||||
command: aider --model o1-mini
|
||||
date: 2024-09-12
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 103.0
|
||||
total_cost: 5.3725
|
||||
|
||||
- dirname: 2024-09-21-16-45-11--o1-preview-flex-sr-markers
|
||||
test_cases: 133
|
||||
model: o1-preview
|
||||
_released: 2024-09-12
|
||||
edit_format: diff
|
||||
commit_hash: 5493654-dirty
|
||||
pass_rate_1: 57.9
|
||||
pass_rate_2: 79.7
|
||||
percent_cases_well_formed: 93.2
|
||||
error_outputs: 11
|
||||
num_malformed_responses: 11
|
||||
num_with_malformed_responses: 9
|
||||
user_asks: 3
|
||||
lazy_comments: 0
|
||||
syntax_errors: 10
|
||||
indentation_errors: 0
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 1
|
||||
command: aider --model o1-preview
|
||||
date: 2024-09-21
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 80.9
|
||||
total_cost: 63.9190
|
||||
|
|
|
@ -16,21 +16,26 @@ nav_exclude: true
|
|||
|
||||
QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before
|
||||
rendering a final response.
|
||||
In this way, it is similar to OpenAI's o1 models which are best used by
|
||||
[pairing the reasoning model as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
|
||||
This is similar to OpenAI's o1 models, which are most effective with aider
|
||||
[when paired as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
|
||||
In this mode, the reasoning model acts as an "architect" to propose a solution to the
|
||||
coding problem without regard for how to actually make edits to the source files.
|
||||
The "editor" model receives that proposal, and focuses solely on how to
|
||||
edit the existing source code to implement it.
|
||||
|
||||
Used alone, QwQ was unable to comply with even the simplest editing format.
|
||||
So it was not very successful at editing source code files.
|
||||
QwQ's solo score on the benchmark was underwhelming,
|
||||
far worse than the o1 models performing solo.
|
||||
Used alone without being paired with an editor,
|
||||
QwQ was unable to comply with even the simplest editing format.
|
||||
It was not able to reliably edit source code files.
|
||||
As a result, QwQ's solo score on the benchmark was quite underwhelming
|
||||
(and far worse than the o1 models performing solo).
|
||||
|
||||
QwQ can perform better than the
|
||||
Qwen 2.5 Coder 32B Instruct model that it is based on
|
||||
when they are paired as architect + editor.
|
||||
This provides only a modest benefit,
|
||||
but results in a fairly slow overall response time.
|
||||
QwQ is based on
|
||||
Qwen 2.5 Coder 32B Instruct,
|
||||
and does better when paired with it as an architect + editor combo.
|
||||
Though this provided only a modest benchmark improvement over just using Qwen alone,
|
||||
and comes with a fairly high cost in terms of latency.
|
||||
Each request must wait for QwQ to return all its thinking text
|
||||
and the ultimate solution.
|
||||
and the final solution proposal.
|
||||
And then one must wait for Qwen to turn that large
|
||||
response into actual file edits.
|
||||
|
||||
|
@ -38,7 +43,7 @@ Pairing QwQ with other sensible editor models performed the same or worse than
|
|||
just using Qwen 2.5 Coder 32B Instruct alone.
|
||||
|
||||
QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
|
||||
That is well off the
|
||||
That is well below the
|
||||
SOTA results for this benchmark: Sonnet alone scores 84%, and
|
||||
o1-preview + o1-mini as architect + editor scores 85%.
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue