mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-03 11:14:59 +00:00
copy
This commit is contained in:
parent
d747a3781d
commit
eba845ea51
3 changed files with 152 additions and 20 deletions
|
@ -1132,4 +1132,49 @@
|
||||||
versions: 0.56.1.dev
|
versions: 0.56.1.dev
|
||||||
seconds_per_case: 177.7
|
seconds_per_case: 177.7
|
||||||
total_cost: 11.1071
|
total_cost: 11.1071
|
||||||
|
|
||||||
|
- dirname: 2024-09-12-22-44-14--o1-preview-diff
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-preview (diff)
|
||||||
|
edit_format: diff
|
||||||
|
commit_hash: 72f52bd
|
||||||
|
pass_rate_1: 56.4
|
||||||
|
pass_rate_2: 75.2
|
||||||
|
percent_cases_well_formed: 84.2
|
||||||
|
error_outputs: 27
|
||||||
|
num_malformed_responses: 27
|
||||||
|
num_with_malformed_responses: 21
|
||||||
|
user_asks: 8
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 7
|
||||||
|
indentation_errors: 3
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 3
|
||||||
|
command: aider --model o1-preview
|
||||||
|
date: 2024-09-12
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 95.8
|
||||||
|
total_cost: 71.7927
|
||||||
|
|
||||||
|
- dirname: 2024-09-13-02-13-59--o1-preview-whole
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-preview (whole)
|
||||||
|
edit_format: whole
|
||||||
|
commit_hash: 72f52bd-dirty
|
||||||
|
pass_rate_1: 58.6
|
||||||
|
pass_rate_2: 79.7
|
||||||
|
percent_cases_well_formed: 100.0
|
||||||
|
error_outputs: 0
|
||||||
|
num_malformed_responses: 0
|
||||||
|
num_with_malformed_responses: 0
|
||||||
|
user_asks: 2
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 1
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 2
|
||||||
|
command: aider --model o1-preview
|
||||||
|
date: 2024-09-13
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 47.4
|
||||||
|
total_cost: 38.0612
|
|
@ -115,4 +115,72 @@
|
||||||
versions: 0.56.1.dev
|
versions: 0.56.1.dev
|
||||||
seconds_per_case: 177.7
|
seconds_per_case: 177.7
|
||||||
total_cost: 11.1071
|
total_cost: 11.1071
|
||||||
|
|
||||||
|
- dirname: 2024-09-05-21-26-49--sonnet-whole-sep5
|
||||||
|
test_cases: 133
|
||||||
|
model: claude-3.5-sonnet (whole)
|
||||||
|
edit_format: whole
|
||||||
|
commit_hash: 8cfdcbd
|
||||||
|
pass_rate_1: 55.6
|
||||||
|
pass_rate_2: 75.2
|
||||||
|
percent_cases_well_formed: 100.0
|
||||||
|
error_outputs: 0
|
||||||
|
num_malformed_responses: 0
|
||||||
|
num_with_malformed_responses: 0
|
||||||
|
user_asks: 0
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 0
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 0
|
||||||
|
command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole
|
||||||
|
date: 2024-09-05
|
||||||
|
versions: 0.55.1.dev
|
||||||
|
seconds_per_case: 15.2
|
||||||
|
total_cost: 2.3502
|
||||||
|
|
||||||
|
- dirname: 2024-09-12-22-44-14--o1-preview-diff
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-preview (diff)
|
||||||
|
edit_format: diff
|
||||||
|
commit_hash: 72f52bd
|
||||||
|
pass_rate_1: 56.4
|
||||||
|
pass_rate_2: 75.2
|
||||||
|
percent_cases_well_formed: 84.2
|
||||||
|
error_outputs: 27
|
||||||
|
num_malformed_responses: 27
|
||||||
|
num_with_malformed_responses: 21
|
||||||
|
user_asks: 8
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 7
|
||||||
|
indentation_errors: 3
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 3
|
||||||
|
command: aider --model o1-preview
|
||||||
|
date: 2024-09-12
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 95.8
|
||||||
|
total_cost: 71.7927
|
||||||
|
|
||||||
|
- dirname: 2024-09-13-02-13-59--o1-preview-whole
|
||||||
|
test_cases: 133
|
||||||
|
model: o1-preview (whole)
|
||||||
|
edit_format: whole
|
||||||
|
commit_hash: 72f52bd-dirty
|
||||||
|
pass_rate_1: 58.6
|
||||||
|
pass_rate_2: 79.7
|
||||||
|
percent_cases_well_formed: 100.0
|
||||||
|
error_outputs: 0
|
||||||
|
num_malformed_responses: 0
|
||||||
|
num_with_malformed_responses: 0
|
||||||
|
user_asks: 2
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 1
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 2
|
||||||
|
command: aider --model o1-preview
|
||||||
|
date: 2024-09-13
|
||||||
|
versions: 0.56.1.dev
|
||||||
|
seconds_per_case: 47.4
|
||||||
|
total_cost: 38.0612
|
|
@ -1,5 +1,5 @@
|
||||||
---
|
---
|
||||||
title: Benchmark results for OpenAI o1-mini
|
title: o1-preview is SOTA on the aider leaderboard
|
||||||
excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
|
excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
|
||||||
nav_exclude: true
|
nav_exclude: true
|
||||||
---
|
---
|
||||||
|
@ -7,7 +7,7 @@ nav_exclude: true
|
||||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
# Benchmark results for OpenAI o1-mini
|
# OpenAI o1-preview is SOTA on the aider leaderboard
|
||||||
|
|
||||||
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
||||||
|
|
||||||
|
@ -20,39 +20,58 @@ nav_exclude: true
|
||||||
%}
|
%}
|
||||||
|
|
||||||
|
|
||||||
|
## o1-preview
|
||||||
|
|
||||||
|
OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
|
||||||
|
a state of the art result.
|
||||||
|
It achieved this result with the
|
||||||
|
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
||||||
|
where the LLM returns a full copy of the source code file with changes.
|
||||||
|
|
||||||
|
It is much more practical to use aider's
|
||||||
|
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
|
||||||
|
which allows the LLM to return search/replace blocks to
|
||||||
|
efficiently edit the source code.
|
||||||
|
This saves significant time and token costs.
|
||||||
|
|
||||||
|
Using the diff edit format the o1-preview model had a strong
|
||||||
|
benchmark score of 75.2%.
|
||||||
|
This likely places o1-preview between Sonnet and GPT-4o for practical use,
|
||||||
|
but at significantly higher cost.
|
||||||
|
|
||||||
|
## o1-mini
|
||||||
|
|
||||||
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
|
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
|
||||||
but scored below those models.
|
but scored below those models.
|
||||||
|
It also works best with the whole edit format.
|
||||||
|
|
||||||
It works best with the
|
|
||||||
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
|
||||||
where it returns a full copy of the source code file with changes.
|
|
||||||
Other frontier models like GPT-4o and Sonnet are able to achieve
|
|
||||||
high benchmark scores using the
|
|
||||||
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
|
||||||
This allows them to return search/replace blocks to
|
|
||||||
efficiently edit the source code, saving time and token costs.
|
|
||||||
|
|
||||||
|
## Future work
|
||||||
|
|
||||||
|
The o1-preview model had trouble conforming to aider's diff edit format.
|
||||||
The o1-mini model had trouble conforming to both the whole and diff edit formats.
|
The o1-mini model had trouble conforming to both the whole and diff edit formats.
|
||||||
Aider is extremely permissive and tries hard to accept anything close
|
Aider is extremely permissive and tries hard to accept anything close
|
||||||
to the correct formats.
|
to the correct formats.
|
||||||
|
|
||||||
It's possible that o1-mini would get better scores if aider prompted with
|
It is surprising that such strong models had trouble with
|
||||||
more examples or was adapted to parse o1-mini's favorite ways to mangle
|
the syntactic requirements of simple text output formats.
|
||||||
the response formats.
|
It seems likely that aider could optimize its prompts and edit formats to
|
||||||
Over time it may be possible to better harness o1-mini's capabilities through
|
better harness the o1 models.
|
||||||
different prompting and editing formats.
|
|
||||||
|
|
||||||
## Using aider with o1-mini and o1-preview
|
|
||||||
|
## Using aider with o1
|
||||||
|
|
||||||
OpenAI's new o1 models are supported in the development version of aider:
|
OpenAI's new o1 models are supported in the development version of aider:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
# To upgrade to the development version:
|
||||||
aider --install-main-branch
|
aider --install-main-branch
|
||||||
# or...
|
|
||||||
|
# Or, to upgrade/install:
|
||||||
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
|
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
|
||||||
|
|
||||||
|
# To launch aider with an o1 model:
|
||||||
aider --model o1-mini
|
aider --model o1-mini
|
||||||
|
|
||||||
aider --model o1-preview
|
aider --model o1-preview
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue