o1-mini diff results

This commit is contained in:
Paul Gauthier 2024-09-12 15:38:40 -07:00
parent 1fbb5079d5
commit c00ac80909
3 changed files with 68 additions and 17 deletions

View file

@ -1089,7 +1089,7 @@
- dirname: 2024-09-12-19-57-35--o1-mini-whole - dirname: 2024-09-12-19-57-35--o1-mini-whole
test_cases: 133 test_cases: 133
model: o1-mini model: o1-mini (whole)
edit_format: whole edit_format: whole
commit_hash: 36fa773-dirty, 291b456 commit_hash: 36fa773-dirty, 291b456
pass_rate_1: 49.6 pass_rate_1: 49.6
@ -1108,4 +1108,28 @@
date: 2024-09-12 date: 2024-09-12
versions: 0.56.1.dev versions: 0.56.1.dev
seconds_per_case: 103.0 seconds_per_case: 103.0
total_cost: 5.3725 total_cost: 5.3725
- dirname: 2024-09-12-20-56-22--o1-mini-diff
test_cases: 133
model: o1-mini (diff)
edit_format: diff
commit_hash: 4598a37-dirty, 291b456, 752e823-dirty
pass_rate_1: 45.1
pass_rate_2: 62.4
percent_cases_well_formed: 85.7
error_outputs: 26
num_malformed_responses: 26
num_with_malformed_responses: 19
user_asks: 2
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 1
command: aider --model o1-mini --edit-format diff
date: 2024-09-12
versions: 0.56.1.dev
seconds_per_case: 177.7
total_cost: 11.1071

View file

@ -91,4 +91,28 @@
date: 2024-09-12 date: 2024-09-12
versions: 0.56.1.dev versions: 0.56.1.dev
seconds_per_case: 103.0 seconds_per_case: 103.0
total_cost: 5.3725 total_cost: 5.3725
- dirname: 2024-09-12-20-56-22--o1-mini-diff
test_cases: 133
model: o1-mini (diff)
edit_format: diff
commit_hash: 4598a37-dirty, 291b456, 752e823-dirty
pass_rate_1: 45.1
pass_rate_2: 62.4
percent_cases_well_formed: 85.7
error_outputs: 26
num_malformed_responses: 26
num_with_malformed_responses: 19
user_asks: 2
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 1
command: aider --model o1-mini --edit-format diff
date: 2024-09-12
versions: 0.56.1.dev
seconds_per_case: 177.7
total_cost: 11.1071

View file

@ -10,23 +10,26 @@ nav_exclude: true
# Benchmark results for OpenAI o1-mini # Benchmark results for OpenAI o1-mini
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models but scored below those models.
when using the "whole" editing format.
It was close enough to GPT-4o to be within the margin of error.
The o1-mini model had trouble following the very simple whole editing format. It works best with the
It's possible that it would get a better score if aider prompted with ["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
more examples or was adapted to parse o1-mini's favorite way to mangle where it returns a full copy of the source code file with changes.
the response format. Other frontier models like GPT-4o and Sonnet are able to achieve
high benchmark scores using the
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
This allows them to return search/replace blocks to
efficiently edit the source code, saving time and token costs.
Note that o1-mini's "whole" score is compared against GPT-4o and Sonnet The o1-mini model had trouble conforming to both the whole and diff edit formats.
"diff" results. Aider is extremely permissive and tries hard to accept anything close
Using diff is more challenging, to the correct formats.
but allows the model to return search/replace blocks to It's possible that o1-mini would get better scores if aider prompted with
efficiently edit the source code. more examples or was adapted to parse o1-mini's favorite ways to mangle
The whole format requires the o1-mini to return a fresh copy of the entire file, the response formats.
increasing costs and latency.
Over time it may be possible to better harness o1-mini's capabilities through
different prompting and editing formats.
## Using aider with o1-mini and o1-preview ## Using aider with o1-mini and o1-preview