copy

2025-06-13 08:05:01 +00:00 · 2024-09-12 20:40:12 -07:00 · 2024-09-12 20:40:12 -07:00 · eba845ea51
commit eba845ea51
parent d747a3781d
3 changed files with 152 additions and 20 deletions
--- a/aider/website/_posts/2024-09-12-o1.md
+++ b/aider/website/_posts/2024-09-12-o1.md
@ -1,5 +1,5 @@
 ---
-title: Benchmark results for OpenAI o1-mini
+title: o1-preview is SOTA on the aider leaderboard
 excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
 nav_exclude: true
 ---
@ -7,7 +7,7 @@ nav_exclude: true
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}

-# Benchmark results for OpenAI o1-mini
+# OpenAI o1-preview is SOTA on the aider leaderboard

 <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

@ -20,39 +20,58 @@ nav_exclude: true
 %}


+## o1-preview
+
+OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
+a state of the art result.
+It achieved this result with the 
+["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
+where the LLM returns a full copy of the source code file with changes.
+
+It is much more practical to use aider's
+["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format).
+which allows the LLM to return search/replace blocks to 
+efficiently edit the source code.
+This saves significant time and token costs.
+
+Using the diff edit format the o1-preview model had a strong
+benchmark score of 75.2%.
+This likely places o1-preview between Sonnet and GPT-4o for practical use,
+but at significantly higher cost.
+
+## o1-mini
+
 OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
 but scored below those models.
+It also works best with the whole edit format.

-It works best with the 
-["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
-where it returns a full copy of the source code file with changes.
-Other frontier models like GPT-4o and Sonnet are able to achieve
-high benchmark scores using the 
-["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
-This allows them to return search/replace blocks to 
-efficiently edit the source code, saving time and token costs.

+## Future work
+
+The o1-preview model had trouble conforming to aider's diff edit format.
 The o1-mini model had trouble conforming to both the whole and diff edit formats.
 Aider is extremely permissive and tries hard to accept anything close
 to the correct formats.

-It's possible that o1-mini would get better scores if aider prompted with
-more examples or was adapted to parse o1-mini's favorite ways to mangle
-the response formats.
-Over time it may be possible to better harness o1-mini's capabilities through
-different prompting and editing formats.
+It is surprising that such strong models had trouble with
+the syntactic requirements of simple text output formats.
+It seems likely that aider could optimize its prompts and edit formats to
+better harness the o1 models.

-## Using aider with o1-mini and o1-preview
+
+## Using aider with o1

 OpenAI's new o1 models are supported in the development version of aider:

 ```
+# To upgrade to the development version:
 aider --install-main-branch
-# or...
+
+# Or, to upgrade/install:
 python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git

+# To launch aider with an o1 model:
 aider --model o1-mini
-
 aider --model o1-preview
 ```