mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-02 02:34:59 +00:00
copy
This commit is contained in:
parent
d747a3781d
commit
eba845ea51
3 changed files with 152 additions and 20 deletions
|
@ -1132,4 +1132,49 @@
|
|||
versions: 0.56.1.dev
|
||||
seconds_per_case: 177.7
|
||||
total_cost: 11.1071
|
||||
|
||||
|
||||
- dirname: 2024-09-12-22-44-14--o1-preview-diff
|
||||
test_cases: 133
|
||||
model: o1-preview (diff)
|
||||
edit_format: diff
|
||||
commit_hash: 72f52bd
|
||||
pass_rate_1: 56.4
|
||||
pass_rate_2: 75.2
|
||||
percent_cases_well_formed: 84.2
|
||||
error_outputs: 27
|
||||
num_malformed_responses: 27
|
||||
num_with_malformed_responses: 21
|
||||
user_asks: 8
|
||||
lazy_comments: 0
|
||||
syntax_errors: 7
|
||||
indentation_errors: 3
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 3
|
||||
command: aider --model o1-preview
|
||||
date: 2024-09-12
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 95.8
|
||||
total_cost: 71.7927
|
||||
|
||||
- dirname: 2024-09-13-02-13-59--o1-preview-whole
|
||||
test_cases: 133
|
||||
model: o1-preview (whole)
|
||||
edit_format: whole
|
||||
commit_hash: 72f52bd-dirty
|
||||
pass_rate_1: 58.6
|
||||
pass_rate_2: 79.7
|
||||
percent_cases_well_formed: 100.0
|
||||
error_outputs: 0
|
||||
num_malformed_responses: 0
|
||||
num_with_malformed_responses: 0
|
||||
user_asks: 2
|
||||
lazy_comments: 0
|
||||
syntax_errors: 1
|
||||
indentation_errors: 0
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 2
|
||||
command: aider --model o1-preview
|
||||
date: 2024-09-13
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 47.4
|
||||
total_cost: 38.0612
|
|
@ -115,4 +115,72 @@
|
|||
versions: 0.56.1.dev
|
||||
seconds_per_case: 177.7
|
||||
total_cost: 11.1071
|
||||
|
||||
|
||||
- dirname: 2024-09-05-21-26-49--sonnet-whole-sep5
|
||||
test_cases: 133
|
||||
model: claude-3.5-sonnet (whole)
|
||||
edit_format: whole
|
||||
commit_hash: 8cfdcbd
|
||||
pass_rate_1: 55.6
|
||||
pass_rate_2: 75.2
|
||||
percent_cases_well_formed: 100.0
|
||||
error_outputs: 0
|
||||
num_malformed_responses: 0
|
||||
num_with_malformed_responses: 0
|
||||
user_asks: 0
|
||||
lazy_comments: 0
|
||||
syntax_errors: 0
|
||||
indentation_errors: 0
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 0
|
||||
command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole
|
||||
date: 2024-09-05
|
||||
versions: 0.55.1.dev
|
||||
seconds_per_case: 15.2
|
||||
total_cost: 2.3502
|
||||
|
||||
- dirname: 2024-09-12-22-44-14--o1-preview-diff
|
||||
test_cases: 133
|
||||
model: o1-preview (diff)
|
||||
edit_format: diff
|
||||
commit_hash: 72f52bd
|
||||
pass_rate_1: 56.4
|
||||
pass_rate_2: 75.2
|
||||
percent_cases_well_formed: 84.2
|
||||
error_outputs: 27
|
||||
num_malformed_responses: 27
|
||||
num_with_malformed_responses: 21
|
||||
user_asks: 8
|
||||
lazy_comments: 0
|
||||
syntax_errors: 7
|
||||
indentation_errors: 3
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 3
|
||||
command: aider --model o1-preview
|
||||
date: 2024-09-12
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 95.8
|
||||
total_cost: 71.7927
|
||||
|
||||
- dirname: 2024-09-13-02-13-59--o1-preview-whole
|
||||
test_cases: 133
|
||||
model: o1-preview (whole)
|
||||
edit_format: whole
|
||||
commit_hash: 72f52bd-dirty
|
||||
pass_rate_1: 58.6
|
||||
pass_rate_2: 79.7
|
||||
percent_cases_well_formed: 100.0
|
||||
error_outputs: 0
|
||||
num_malformed_responses: 0
|
||||
num_with_malformed_responses: 0
|
||||
user_asks: 2
|
||||
lazy_comments: 0
|
||||
syntax_errors: 1
|
||||
indentation_errors: 0
|
||||
exhausted_context_windows: 0
|
||||
test_timeouts: 2
|
||||
command: aider --model o1-preview
|
||||
date: 2024-09-13
|
||||
versions: 0.56.1.dev
|
||||
seconds_per_case: 47.4
|
||||
total_cost: 38.0612
|
|
@ -1,5 +1,5 @@
|
|||
---
|
||||
title: Benchmark results for OpenAI o1-mini
|
||||
title: o1-preview is SOTA on the aider leaderboard
|
||||
excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
|
||||
nav_exclude: true
|
||||
---
|
||||
|
@ -7,7 +7,7 @@ nav_exclude: true
|
|||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Benchmark results for OpenAI o1-mini
|
||||
# OpenAI o1-preview is SOTA on the aider leaderboard
|
||||
|
||||
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
||||
|
||||
|
@ -20,39 +20,58 @@ nav_exclude: true
|
|||
%}
|
||||
|
||||
|
||||
## o1-preview
|
||||
|
||||
OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
|
||||
a state of the art result.
|
||||
It achieved this result with the
|
||||
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
||||
where the LLM returns a full copy of the source code file with changes.
|
||||
|
||||
It is much more practical to use aider's
|
||||
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
|
||||
which allows the LLM to return search/replace blocks to
|
||||
efficiently edit the source code.
|
||||
This saves significant time and token costs.
|
||||
|
||||
Using the diff edit format the o1-preview model had a strong
|
||||
benchmark score of 75.2%.
|
||||
This likely places o1-preview between Sonnet and GPT-4o for practical use,
|
||||
but at significantly higher cost.
|
||||
|
||||
## o1-mini
|
||||
|
||||
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
|
||||
but scored below those models.
|
||||
It also works best with the whole edit format.
|
||||
|
||||
It works best with the
|
||||
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
||||
where it returns a full copy of the source code file with changes.
|
||||
Other frontier models like GPT-4o and Sonnet are able to achieve
|
||||
high benchmark scores using the
|
||||
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
||||
This allows them to return search/replace blocks to
|
||||
efficiently edit the source code, saving time and token costs.
|
||||
|
||||
## Future work
|
||||
|
||||
The o1-preview model had trouble conforming to aider's diff edit format.
|
||||
The o1-mini model had trouble conforming to both the whole and diff edit formats.
|
||||
Aider is extremely permissive and tries hard to accept anything close
|
||||
to the correct formats.
|
||||
|
||||
It's possible that o1-mini would get better scores if aider prompted with
|
||||
more examples or was adapted to parse o1-mini's favorite ways to mangle
|
||||
the response formats.
|
||||
Over time it may be possible to better harness o1-mini's capabilities through
|
||||
different prompting and editing formats.
|
||||
It is surprising that such strong models had trouble with
|
||||
the syntactic requirements of simple text output formats.
|
||||
It seems likely that aider could optimize its prompts and edit formats to
|
||||
better harness the o1 models.
|
||||
|
||||
## Using aider with o1-mini and o1-preview
|
||||
|
||||
## Using aider with o1
|
||||
|
||||
OpenAI's new o1 models are supported in the development version of aider:
|
||||
|
||||
```
|
||||
# To upgrade to the development version:
|
||||
aider --install-main-branch
|
||||
# or...
|
||||
|
||||
# Or, to upgrade/install:
|
||||
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
|
||||
|
||||
# To launch aider with an o1 model:
|
||||
aider --model o1-mini
|
||||
|
||||
aider --model o1-preview
|
||||
```
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue