o1-mini diff results

2025-05-30 09:14:59 +00:00 · 2024-09-12 15:38:40 -07:00 · 2024-09-12 15:38:40 -07:00 · c00ac80909
commit c00ac80909
parent 1fbb5079d5
3 changed files with 68 additions and 17 deletions
--- a/aider/website/_data/edit_leaderboard.yml
+++ b/aider/website/_data/edit_leaderboard.yml
@ -1089,7 +1089,7 @@
 - dirname: 2024-09-12-19-57-35--o1-mini-whole
  test_cases: 133
-  model: o1-mini
+  model: o1-mini (whole)
  edit_format: whole
  commit_hash: 36fa773-dirty, 291b456
  pass_rate_1: 49.6
@ -1108,4 +1108,28 @@
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 103.0
-  total_cost: 5.3725  
+  total_cost: 5.3725
 - dirname: 2024-09-12-20-56-22--o1-mini-diff
  test_cases: 133
  model: o1-mini (diff)
  edit_format: diff
  commit_hash: 4598a37-dirty, 291b456, 752e823-dirty
  pass_rate_1: 45.1
  pass_rate_2: 62.4
  percent_cases_well_formed: 85.7
  error_outputs: 26
  num_malformed_responses: 26
  num_with_malformed_responses: 19
  user_asks: 2
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  command: aider --model o1-mini --edit-format diff
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
--- a/aider/website/_data/o1_results.yml
+++ b/aider/website/_data/o1_results.yml
@ -91,4 +91,28 @@
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 103.0
-  total_cost: 5.3725
+  total_cost: 5.3725
 - dirname: 2024-09-12-20-56-22--o1-mini-diff
  test_cases: 133
  model: o1-mini (diff)
  edit_format: diff
  commit_hash: 4598a37-dirty, 291b456, 752e823-dirty
  pass_rate_1: 45.1
  pass_rate_2: 62.4
  percent_cases_well_formed: 85.7
  error_outputs: 26
  num_malformed_responses: 26
  num_with_malformed_responses: 19
  user_asks: 2
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  command: aider --model o1-mini --edit-format diff
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
--- a/aider/website/_posts/2024-09-12-o1.md
+++ b/aider/website/_posts/2024-09-12-o1.md
@ -10,23 +10,26 @@ nav_exclude: true
 # Benchmark results for OpenAI o1-mini
 OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
-but scored below those models
+but scored below those models.
 when using the "whole" editing format.
 It was close enough to GPT-4o to be within the margin of error.
-The o1-mini model had trouble following the very simple whole editing format.
+It works best with the 
-It's possible that it would get a better score if aider prompted with
+["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
-more examples or was adapted to parse o1-mini's favorite way to mangle
+where it returns a full copy of the source code file with changes.
-the response format.
+Other frontier models like GPT-4o and Sonnet are able to achieve
 high benchmark scores using the 
 ["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
 This allows them to return search/replace blocks to 
 efficiently edit the source code, saving time and token costs.
-Note that o1-mini's "whole" score is compared against GPT-4o and Sonnet 
+The o1-mini model had trouble conforming to both the whole and diff edit formats.
-"diff" results.
+Aider is extremely permissive and tries hard to accept anything close
-Using diff is more challenging,
+to the correct formats.
-but allows the model to return search/replace blocks to 
+It's possible that o1-mini would get better scores if aider prompted with
-efficiently edit the source code.
+more examples or was adapted to parse o1-mini's favorite ways to mangle
-The whole format requires the o1-mini to return a fresh copy of the entire file,
+the response formats.
 increasing costs and latency.
 Over time it may be possible to better harness o1-mini's capabilities through
 different prompting and editing formats.
 ## Using aider with o1-mini and o1-preview