copy

2025-06-03 11:14:59 +00:00 · 2024-09-12 20:40:12 -07:00 · 2024-09-12 20:40:12 -07:00 · eba845ea51
commit eba845ea51
parent d747a3781d
3 changed files with 152 additions and 20 deletions
--- a/aider/website/_data/edit_leaderboard.yml
+++ b/aider/website/_data/edit_leaderboard.yml
@ -1132,4 +1132,49 @@
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
-  
+
 - dirname: 2024-09-12-22-44-14--o1-preview-diff
  test_cases: 133
  model: o1-preview (diff)
  edit_format: diff
  commit_hash: 72f52bd
  pass_rate_1: 56.4
  pass_rate_2: 75.2
  percent_cases_well_formed: 84.2
  error_outputs: 27
  num_malformed_responses: 27
  num_with_malformed_responses: 21
  user_asks: 8
  lazy_comments: 0
  syntax_errors: 7
  indentation_errors: 3
  exhausted_context_windows: 0
  test_timeouts: 3
  command: aider --model o1-preview
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 95.8
  total_cost: 71.7927
 - dirname: 2024-09-13-02-13-59--o1-preview-whole
  test_cases: 133
  model: o1-preview (whole)
  edit_format: whole
  commit_hash: 72f52bd-dirty
  pass_rate_1: 58.6
  pass_rate_2: 79.7
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 2
  lazy_comments: 0
  syntax_errors: 1
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 2
  command: aider --model o1-preview
  date: 2024-09-13
  versions: 0.56.1.dev
  seconds_per_case: 47.4
  total_cost: 38.0612
--- a/aider/website/_data/o1_results.yml
+++ b/aider/website/_data/o1_results.yml
@ -115,4 +115,72 @@
  versions: 0.56.1.dev
  seconds_per_case: 177.7
  total_cost: 11.1071
-  
+  
 - dirname: 2024-09-05-21-26-49--sonnet-whole-sep5
  test_cases: 133
  model: claude-3.5-sonnet (whole)
  edit_format: whole
  commit_hash: 8cfdcbd
  pass_rate_1: 55.6
  pass_rate_2: 75.2
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 0
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole
  date: 2024-09-05
  versions: 0.55.1.dev
  seconds_per_case: 15.2
  total_cost: 2.3502
 - dirname: 2024-09-12-22-44-14--o1-preview-diff
  test_cases: 133
  model: o1-preview (diff)
  edit_format: diff
  commit_hash: 72f52bd
  pass_rate_1: 56.4
  pass_rate_2: 75.2
  percent_cases_well_formed: 84.2
  error_outputs: 27
  num_malformed_responses: 27
  num_with_malformed_responses: 21
  user_asks: 8
  lazy_comments: 0
  syntax_errors: 7
  indentation_errors: 3
  exhausted_context_windows: 0
  test_timeouts: 3
  command: aider --model o1-preview
  date: 2024-09-12
  versions: 0.56.1.dev
  seconds_per_case: 95.8
  total_cost: 71.7927
 - dirname: 2024-09-13-02-13-59--o1-preview-whole
  test_cases: 133
  model: o1-preview (whole)
  edit_format: whole
  commit_hash: 72f52bd-dirty
  pass_rate_1: 58.6
  pass_rate_2: 79.7
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 2
  lazy_comments: 0
  syntax_errors: 1
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 2
  command: aider --model o1-preview
  date: 2024-09-13
  versions: 0.56.1.dev
  seconds_per_case: 47.4
  total_cost: 38.0612
--- a/aider/website/_posts/2024-09-12-o1.md
+++ b/aider/website/_posts/2024-09-12-o1.md
@ -1,5 +1,5 @@
 ---
-title: Benchmark results for OpenAI o1-mini
+title: o1-preview is SOTA on the aider leaderboard
 excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
 nav_exclude: true
 ---
@ -7,7 +7,7 @@ nav_exclude: true
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}
-# Benchmark results for OpenAI o1-mini
+# OpenAI o1-preview is SOTA on the aider leaderboard
 <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
@ -20,39 +20,58 @@ nav_exclude: true
 %}
 ## o1-preview
 OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
 a state of the art result.
 It achieved this result with the 
 ["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
 where the LLM returns a full copy of the source code file with changes.
 It is much more practical to use aider's
 ["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
 which allows the LLM to return search/replace blocks to 
 efficiently edit the source code.
 This saves significant time and token costs.
 Using the diff edit format the o1-preview model had a strong
 benchmark score of 75.2%.
 This likely places o1-preview between Sonnet and GPT-4o for practical use,
 but at significantly higher cost.
 ## o1-mini
 OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
 but scored below those models.
 It also works best with the whole edit format.
 It works best with the 
 ["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
 where it returns a full copy of the source code file with changes.
 Other frontier models like GPT-4o and Sonnet are able to achieve
 high benchmark scores using the 
 ["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
 This allows them to return search/replace blocks to 
 efficiently edit the source code, saving time and token costs.
 ## Future work
 The o1-preview model had trouble conforming to aider's diff edit format.
 The o1-mini model had trouble conforming to both the whole and diff edit formats.
 Aider is extremely permissive and tries hard to accept anything close
 to the correct formats.
-It's possible that o1-mini would get better scores if aider prompted with
+It is surprising that such strong models had trouble with
-more examples or was adapted to parse o1-mini's favorite ways to mangle
+the syntactic requirements of simple text output formats.
-the response formats.
+It seems likely that aider could optimize its prompts and edit formats to
-Over time it may be possible to better harness o1-mini's capabilities through
+better harness the o1 models.
 different prompting and editing formats.
-## Using aider with o1-mini and o1-preview
+
 ## Using aider with o1
 OpenAI's new o1 models are supported in the development version of aider:
 ```
 # To upgrade to the development version:
 aider --install-main-branch
-# or...
+
 # Or, to upgrade/install:
 python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
 # To launch aider with an o1 model:
 aider --model o1-mini
 aider --model o1-preview
 ```