This commit is contained in:
Paul Gauthier 2024-09-12 20:40:12 -07:00
parent d747a3781d
commit eba845ea51
3 changed files with 152 additions and 20 deletions

View file

@ -1132,4 +1132,49 @@
versions: 0.56.1.dev versions: 0.56.1.dev
seconds_per_case: 177.7 seconds_per_case: 177.7
total_cost: 11.1071 total_cost: 11.1071
- dirname: 2024-09-12-22-44-14--o1-preview-diff
test_cases: 133
model: o1-preview (diff)
edit_format: diff
commit_hash: 72f52bd
pass_rate_1: 56.4
pass_rate_2: 75.2
percent_cases_well_formed: 84.2
error_outputs: 27
num_malformed_responses: 27
num_with_malformed_responses: 21
user_asks: 8
lazy_comments: 0
syntax_errors: 7
indentation_errors: 3
exhausted_context_windows: 0
test_timeouts: 3
command: aider --model o1-preview
date: 2024-09-12
versions: 0.56.1.dev
seconds_per_case: 95.8
total_cost: 71.7927
- dirname: 2024-09-13-02-13-59--o1-preview-whole
test_cases: 133
model: o1-preview (whole)
edit_format: whole
commit_hash: 72f52bd-dirty
pass_rate_1: 58.6
pass_rate_2: 79.7
percent_cases_well_formed: 100.0
error_outputs: 0
num_malformed_responses: 0
num_with_malformed_responses: 0
user_asks: 2
lazy_comments: 0
syntax_errors: 1
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 2
command: aider --model o1-preview
date: 2024-09-13
versions: 0.56.1.dev
seconds_per_case: 47.4
total_cost: 38.0612

View file

@ -115,4 +115,72 @@
versions: 0.56.1.dev versions: 0.56.1.dev
seconds_per_case: 177.7 seconds_per_case: 177.7
total_cost: 11.1071 total_cost: 11.1071
- dirname: 2024-09-05-21-26-49--sonnet-whole-sep5
test_cases: 133
model: claude-3.5-sonnet (whole)
edit_format: whole
commit_hash: 8cfdcbd
pass_rate_1: 55.6
pass_rate_2: 75.2
percent_cases_well_formed: 100.0
error_outputs: 0
num_malformed_responses: 0
num_with_malformed_responses: 0
user_asks: 0
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 0
command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole
date: 2024-09-05
versions: 0.55.1.dev
seconds_per_case: 15.2
total_cost: 2.3502
- dirname: 2024-09-12-22-44-14--o1-preview-diff
test_cases: 133
model: o1-preview (diff)
edit_format: diff
commit_hash: 72f52bd
pass_rate_1: 56.4
pass_rate_2: 75.2
percent_cases_well_formed: 84.2
error_outputs: 27
num_malformed_responses: 27
num_with_malformed_responses: 21
user_asks: 8
lazy_comments: 0
syntax_errors: 7
indentation_errors: 3
exhausted_context_windows: 0
test_timeouts: 3
command: aider --model o1-preview
date: 2024-09-12
versions: 0.56.1.dev
seconds_per_case: 95.8
total_cost: 71.7927
- dirname: 2024-09-13-02-13-59--o1-preview-whole
test_cases: 133
model: o1-preview (whole)
edit_format: whole
commit_hash: 72f52bd-dirty
pass_rate_1: 58.6
pass_rate_2: 79.7
percent_cases_well_formed: 100.0
error_outputs: 0
num_malformed_responses: 0
num_with_malformed_responses: 0
user_asks: 2
lazy_comments: 0
syntax_errors: 1
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 2
command: aider --model o1-preview
date: 2024-09-13
versions: 0.56.1.dev
seconds_per_case: 47.4
total_cost: 38.0612

View file

@ -1,5 +1,5 @@
--- ---
title: Benchmark results for OpenAI o1-mini title: o1-preview is SOTA on the aider leaderboard
excerpt: Preliminary benchmark results for the new OpenAI o1-mini model. excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
nav_exclude: true nav_exclude: true
--- ---
@ -7,7 +7,7 @@ nav_exclude: true
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p> <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
{% endif %} {% endif %}
# Benchmark results for OpenAI o1-mini # OpenAI o1-preview is SOTA on the aider leaderboard
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script> <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
@ -20,39 +20,58 @@ nav_exclude: true
%} %}
## o1-preview
OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
a state of the art result.
It achieved this result with the
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
where the LLM returns a full copy of the source code file with changes.
It is much more practical to use aider's
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
which allows the LLM to return search/replace blocks to
efficiently edit the source code.
This saves significant time and token costs.
Using the diff edit format the o1-preview model had a strong
benchmark score of 75.2%.
This likely places o1-preview between Sonnet and GPT-4o for practical use,
but at significantly higher cost.
## o1-mini
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models. but scored below those models.
It also works best with the whole edit format.
It works best with the
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
where it returns a full copy of the source code file with changes.
Other frontier models like GPT-4o and Sonnet are able to achieve
high benchmark scores using the
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
This allows them to return search/replace blocks to
efficiently edit the source code, saving time and token costs.
## Future work
The o1-preview model had trouble conforming to aider's diff edit format.
The o1-mini model had trouble conforming to both the whole and diff edit formats. The o1-mini model had trouble conforming to both the whole and diff edit formats.
Aider is extremely permissive and tries hard to accept anything close Aider is extremely permissive and tries hard to accept anything close
to the correct formats. to the correct formats.
It's possible that o1-mini would get better scores if aider prompted with It is surprising that such strong models had trouble with
more examples or was adapted to parse o1-mini's favorite ways to mangle the syntactic requirements of simple text output formats.
the response formats. It seems likely that aider could optimize its prompts and edit formats to
Over time it may be possible to better harness o1-mini's capabilities through better harness the o1 models.
different prompting and editing formats.
## Using aider with o1-mini and o1-preview
## Using aider with o1
OpenAI's new o1 models are supported in the development version of aider: OpenAI's new o1 models are supported in the development version of aider:
``` ```
# To upgrade to the development version:
aider --install-main-branch aider --install-main-branch
# or...
# Or, to upgrade/install:
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
# To launch aider with an o1 model:
aider --model o1-mini aider --model o1-mini
aider --model o1-preview aider --model o1-preview
``` ```