mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
copy
This commit is contained in:
parent
f26ccfa3e9
commit
0d983d504b
2 changed files with 66 additions and 14 deletions
|
@ -120,4 +120,51 @@
|
||||||
date: 2024-12-04
|
date: 2024-12-04
|
||||||
versions: 0.66.1.dev
|
versions: 0.66.1.dev
|
||||||
seconds_per_case: 414.3
|
seconds_per_case: 414.3
|
||||||
total_cost: 0.0000
|
total_cost: 0.0000
|
||||||
|
|
||||||
|
- dirname: 2024-09-12-19-57-35--o1-mini-whole
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-mini
|
||||||
|
edit_format: whole
|
||||||
|
commit_hash: 36fa773-dirty, 291b456
|
||||||
|
pass_rate_1: 49.6
|
||||||
|
pass_rate_2: 70.7
|
||||||
|
percent_cases_well_formed: 90.0
|
||||||
|
error_outputs: 0
|
||||||
|
num_malformed_responses: 0
|
||||||
|
num_with_malformed_responses: 0
|
||||||
|
user_asks: 17
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 0
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 1
|
||||||
|
command: aider --model o1-mini
|
||||||
|
date: 2024-09-12
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 103.0
|
||||||
|
total_cost: 5.3725
|
||||||
|
|
||||||
|
- dirname: 2024-09-21-16-45-11--o1-preview-flex-sr-markers
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-preview
|
||||||
|
_released: 2024-09-12
|
||||||
|
edit_format: diff
|
||||||
|
commit_hash: 5493654-dirty
|
||||||
|
pass_rate_1: 57.9
|
||||||
|
pass_rate_2: 79.7
|
||||||
|
percent_cases_well_formed: 93.2
|
||||||
|
error_outputs: 11
|
||||||
|
num_malformed_responses: 11
|
||||||
|
num_with_malformed_responses: 9
|
||||||
|
user_asks: 3
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 10
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 1
|
||||||
|
command: aider --model o1-preview
|
||||||
|
date: 2024-09-21
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 80.9
|
||||||
|
total_cost: 63.9190
|
||||||
|
|
|
@ -16,21 +16,26 @@ nav_exclude: true
|
||||||
|
|
||||||
QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before
|
QwQ 32B Preview is a "reasoning" model, which spends a lot of tokens thinking before
|
||||||
rendering a final response.
|
rendering a final response.
|
||||||
In this way, it is similar to OpenAI's o1 models which are best used by
|
This is similar to OpenAI's o1 models, which are most effective with aider
|
||||||
[pairing the reasoning model as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
|
[when paired as an architect with a traditional LLM as an editor](https://aider.chat/2024/09/26/architect.html).
|
||||||
|
In this mode, the reasoning model acts as an "architect" to propose a solution to the
|
||||||
|
coding problem without regard for how to actually make edits to the source files.
|
||||||
|
The "editor" model receives that proposal, and focuses solely on how to
|
||||||
|
edit the existing source code to implement it.
|
||||||
|
|
||||||
Used alone, QwQ was unable to comply with even the simplest editing format.
|
Used alone without being paired with an editor,
|
||||||
So it was not very successful at editing source code files.
|
QwQ was unable to comply with even the simplest editing format.
|
||||||
QwQ's solo score on the benchmark was underwhelming,
|
It was not able to reliably edit source code files.
|
||||||
far worse than the o1 models performing solo.
|
As a result, QwQ's solo score on the benchmark was quite underwhelming
|
||||||
|
(and far worse than the o1 models performing solo).
|
||||||
|
|
||||||
QwQ can perform better than the
|
QwQ is based on
|
||||||
Qwen 2.5 Coder 32B Instruct model that it is based on
|
Qwen 2.5 Coder 32B Instruct,
|
||||||
when they are paired as architect + editor.
|
and does better when paired with it as an architect + editor combo.
|
||||||
This provides only a modest benefit,
|
Though this provided only a modest benchmark improvement over just using Qwen alone,
|
||||||
but results in a fairly slow overall response time.
|
and comes with a fairly high cost in terms of latency.
|
||||||
Each request must wait for QwQ to return all its thinking text
|
Each request must wait for QwQ to return all its thinking text
|
||||||
and the ultimate solution.
|
and the final solution proposal.
|
||||||
And then one must wait for Qwen to turn that large
|
And then one must wait for Qwen to turn that large
|
||||||
response into actual file edits.
|
response into actual file edits.
|
||||||
|
|
||||||
|
@ -38,7 +43,7 @@ Pairing QwQ with other sensible editor models performed the same or worse than
|
||||||
just using Qwen 2.5 Coder 32B Instruct alone.
|
just using Qwen 2.5 Coder 32B Instruct alone.
|
||||||
|
|
||||||
QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
|
QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
|
||||||
That is well off the
|
That is well below the
|
||||||
SOTA results for this benchmark: Sonnet alone scores 84%, and
|
SOTA results for this benchmark: Sonnet alone scores 84%, and
|
||||||
o1-preview + o1-mini as architect + editor scores 85%.
|
o1-preview + o1-mini as architect + editor scores 85%.
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue