diff --git a/aider/website/_posts/2024-09-12-o1.md b/aider/website/_posts/2024-09-12-o1.md index 96b58c097..0b06fdee3 100644 --- a/aider/website/_posts/2024-09-12-o1.md +++ b/aider/website/_posts/2024-09-12-o1.md @@ -9,6 +9,17 @@ nav_exclude: true # Benchmark results for OpenAI o1-mini + + +{% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %} +{% include leaderboard_graph.html + chart_id="editChart" + data=edit_sorted + row_prefix="edit-row" + pass_rate_key="pass_rate_2" +%} + + OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. @@ -24,10 +35,10 @@ efficiently edit the source code, saving time and token costs. The o1-mini model had trouble conforming to both the whole and diff edit formats. Aider is extremely permissive and tries hard to accept anything close to the correct formats. + It's possible that o1-mini would get better scores if aider prompted with more examples or was adapted to parse o1-mini's favorite ways to mangle the response formats. - Over time it may be possible to better harness o1-mini's capabilities through different prompting and editing formats. @@ -49,6 +60,7 @@ aider --model o1-preview > These are *preliminiary* benchmark results, which will be updated as > additional benchmark runs complete and rate limits open up. + @@ -60,7 +72,6 @@ aider --model o1-preview - {% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %} {% for row in edit_sorted %} @@ -73,14 +84,6 @@ aider --model o1-preview
{{ row.model }}
- - -{% include leaderboard_graph.html - chart_id="editChart" - data=edit_sorted - row_prefix="edit-row" - pass_rate_key="pass_rate_2" -%}