copy

2025-06-02 02:34:59 +00:00 · 2024-09-12 20:40:12 -07:00 · 2024-09-12 20:40:12 -07:00 · eba845ea51
commit eba845ea51
parent d747a3781d
3 changed files with 152 additions and 20 deletions
--- a/aider/website/_data/edit_leaderboard.yml
+++ b/aider/website/_data/edit_leaderboard.yml
@ -1132,4 +1132,49 @@
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
-  
+
+- dirname: 2024-09-12-22-44-14--o1-preview-diff
+  test_cases: 133
+  model: o1-preview (diff)
+  edit_format: diff
+  commit_hash: 72f52bd
+  pass_rate_1: 56.4
+  pass_rate_2: 75.2
+  percent_cases_well_formed: 84.2
+  error_outputs: 27
+  num_malformed_responses: 27
+  num_with_malformed_responses: 21
+  user_asks: 8
+  lazy_comments: 0
+  syntax_errors: 7
+  indentation_errors: 3
+  exhausted_context_windows: 0
+  test_timeouts: 3
+  command: aider --model o1-preview
+  date: 2024-09-12
+  versions: 0.56.1.dev
+  seconds_per_case: 95.8
+  total_cost: 71.7927
+
+- dirname: 2024-09-13-02-13-59--o1-preview-whole
+  test_cases: 133
+  model: o1-preview (whole)
+  edit_format: whole
+  commit_hash: 72f52bd-dirty
+  pass_rate_1: 58.6
+  pass_rate_2: 79.7
+  percent_cases_well_formed: 100.0
+  error_outputs: 0
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 2
+  lazy_comments: 0
+  syntax_errors: 1
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 2
+  command: aider --model o1-preview
+  date: 2024-09-13
+  versions: 0.56.1.dev
+  seconds_per_case: 47.4
+  total_cost: 38.0612
--- a/aider/website/_data/o1_results.yml
+++ b/aider/website/_data/o1_results.yml
@ -115,4 +115,72 @@
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
-  
+  
+- dirname: 2024-09-05-21-26-49--sonnet-whole-sep5
+  test_cases: 133
+  model: claude-3.5-sonnet (whole)
+  edit_format: whole
+  commit_hash: 8cfdcbd
+  pass_rate_1: 55.6
+  pass_rate_2: 75.2
+  percent_cases_well_formed: 100.0
+  error_outputs: 0
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 0
+  lazy_comments: 0
+  syntax_errors: 0
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 0
+  command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole
+  date: 2024-09-05
+  versions: 0.55.1.dev
+  seconds_per_case: 15.2
+  total_cost: 2.3502
+  
+- dirname: 2024-09-12-22-44-14--o1-preview-diff
+  test_cases: 133
+  model: o1-preview (diff)
+  edit_format: diff
+  commit_hash: 72f52bd
+  pass_rate_1: 56.4
+  pass_rate_2: 75.2
+  percent_cases_well_formed: 84.2
+  error_outputs: 27
+  num_malformed_responses: 27
+  num_with_malformed_responses: 21
+  user_asks: 8
+  lazy_comments: 0
+  syntax_errors: 7
+  indentation_errors: 3
+  exhausted_context_windows: 0
+  test_timeouts: 3
+  command: aider --model o1-preview
+  date: 2024-09-12
+  versions: 0.56.1.dev
+  seconds_per_case: 95.8
+  total_cost: 71.7927
+
+- dirname: 2024-09-13-02-13-59--o1-preview-whole
+  test_cases: 133
+  model: o1-preview (whole)
+  edit_format: whole
+  commit_hash: 72f52bd-dirty
+  pass_rate_1: 58.6
+  pass_rate_2: 79.7
+  percent_cases_well_formed: 100.0
+  error_outputs: 0
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 2
+  lazy_comments: 0
+  syntax_errors: 1
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 2
+  command: aider --model o1-preview
+  date: 2024-09-13
+  versions: 0.56.1.dev
+  seconds_per_case: 47.4
+  total_cost: 38.0612
--- a/aider/website/_posts/2024-09-12-o1.md
+++ b/aider/website/_posts/2024-09-12-o1.md
@ -1,5 +1,5 @@
 ---
-title: Benchmark results for OpenAI o1-mini
+title: o1-preview is SOTA on the aider leaderboard
 excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
 nav_exclude: true
 ---
@ -7,7 +7,7 @@ nav_exclude: true
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}

-# Benchmark results for OpenAI o1-mini
+# OpenAI o1-preview is SOTA on the aider leaderboard

 <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

@ -20,39 +20,58 @@ nav_exclude: true
 %}


+## o1-preview
+
+OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
+a state of the art result.
+It achieved this result with the 
+["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
+where the LLM returns a full copy of the source code file with changes.
+
+It is much more practical to use aider's
+["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
+which allows the LLM to return search/replace blocks to 
+efficiently edit the source code.
+This saves significant time and token costs.
+
+Using the diff edit format the o1-preview model had a strong
+benchmark score of 75.2%.
+This likely places o1-preview between Sonnet and GPT-4o for practical use,
+but at significantly higher cost.
+
+## o1-mini
+
 OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
 but scored below those models.
+It also works best with the whole edit format.

-It works best with the 
-["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
-where it returns a full copy of the source code file with changes.
-Other frontier models like GPT-4o and Sonnet are able to achieve
-high benchmark scores using the 
-["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
-This allows them to return search/replace blocks to 
-efficiently edit the source code, saving time and token costs.

+## Future work
+
+The o1-preview model had trouble conforming to aider's diff edit format.
 The o1-mini model had trouble conforming to both the whole and diff edit formats.
 Aider is extremely permissive and tries hard to accept anything close
 to the correct formats.

-It's possible that o1-mini would get better scores if aider prompted with
-more examples or was adapted to parse o1-mini's favorite ways to mangle
-the response formats.
-Over time it may be possible to better harness o1-mini's capabilities through
-different prompting and editing formats.
+It is surprising that such strong models had trouble with
+the syntactic requirements of simple text output formats.
+It seems likely that aider could optimize its prompts and edit formats to
+better harness the o1 models.

-## Using aider with o1-mini and o1-preview
+
+## Using aider with o1

 OpenAI's new o1 models are supported in the development version of aider:

 ```
+# To upgrade to the development version:
 aider --install-main-branch
-# or...
+
+# Or, to upgrade/install:
 python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git

+# To launch aider with an o1 model:
 aider --model o1-mini
-
 aider --model o1-preview
 ```