diff --git a/aider/website/_data/edit_leaderboard.yml b/aider/website/_data/edit_leaderboard.yml index 95d07f2b0..ca602ed83 100644 --- a/aider/website/_data/edit_leaderboard.yml +++ b/aider/website/_data/edit_leaderboard.yml @@ -1132,4 +1132,49 @@ versions: 0.56.1.dev seconds_per_case: 177.7 total_cost: 11.1071 - \ No newline at end of file + +- dirname: 2024-09-12-22-44-14--o1-preview-diff + test_cases: 133 + model: o1-preview (diff) + edit_format: diff + commit_hash: 72f52bd + pass_rate_1: 56.4 + pass_rate_2: 75.2 + percent_cases_well_formed: 84.2 + error_outputs: 27 + num_malformed_responses: 27 + num_with_malformed_responses: 21 + user_asks: 8 + lazy_comments: 0 + syntax_errors: 7 + indentation_errors: 3 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model o1-preview + date: 2024-09-12 + versions: 0.56.1.dev + seconds_per_case: 95.8 + total_cost: 71.7927 + +- dirname: 2024-09-13-02-13-59--o1-preview-whole + test_cases: 133 + model: o1-preview (whole) + edit_format: whole + commit_hash: 72f52bd-dirty + pass_rate_1: 58.6 + pass_rate_2: 79.7 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 2 + command: aider --model o1-preview + date: 2024-09-13 + versions: 0.56.1.dev + seconds_per_case: 47.4 + total_cost: 38.0612 \ No newline at end of file diff --git a/aider/website/_data/o1_results.yml b/aider/website/_data/o1_results.yml index 292e258a2..099355e55 100644 --- a/aider/website/_data/o1_results.yml +++ b/aider/website/_data/o1_results.yml @@ -115,4 +115,72 @@ versions: 0.56.1.dev seconds_per_case: 177.7 total_cost: 11.1071 - \ No newline at end of file + +- dirname: 2024-09-05-21-26-49--sonnet-whole-sep5 + test_cases: 133 + model: claude-3.5-sonnet (whole) + edit_format: whole + commit_hash: 8cfdcbd + pass_rate_1: 55.6 + pass_rate_2: 75.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole + date: 2024-09-05 + versions: 0.55.1.dev + seconds_per_case: 15.2 + total_cost: 2.3502 + +- dirname: 2024-09-12-22-44-14--o1-preview-diff + test_cases: 133 + model: o1-preview (diff) + edit_format: diff + commit_hash: 72f52bd + pass_rate_1: 56.4 + pass_rate_2: 75.2 + percent_cases_well_formed: 84.2 + error_outputs: 27 + num_malformed_responses: 27 + num_with_malformed_responses: 21 + user_asks: 8 + lazy_comments: 0 + syntax_errors: 7 + indentation_errors: 3 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model o1-preview + date: 2024-09-12 + versions: 0.56.1.dev + seconds_per_case: 95.8 + total_cost: 71.7927 + +- dirname: 2024-09-13-02-13-59--o1-preview-whole + test_cases: 133 + model: o1-preview (whole) + edit_format: whole + commit_hash: 72f52bd-dirty + pass_rate_1: 58.6 + pass_rate_2: 79.7 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 2 + command: aider --model o1-preview + date: 2024-09-13 + versions: 0.56.1.dev + seconds_per_case: 47.4 + total_cost: 38.0612 \ No newline at end of file diff --git a/aider/website/_posts/2024-09-12-o1.md b/aider/website/_posts/2024-09-12-o1.md index 0b06fdee3..c5399cc49 100644 --- a/aider/website/_posts/2024-09-12-o1.md +++ b/aider/website/_posts/2024-09-12-o1.md @@ -1,5 +1,5 @@ --- -title: Benchmark results for OpenAI o1-mini +title: o1-preview is SOTA on the aider leaderboard excerpt: Preliminary benchmark results for the new OpenAI o1-mini model. nav_exclude: true --- @@ -7,7 +7,7 @@ nav_exclude: true

{{ page.date | date: "%B %d, %Y" }}

{% endif %} -# Benchmark results for OpenAI o1-mini +# OpenAI o1-preview is SOTA on the aider leaderboard @@ -20,39 +20,58 @@ nav_exclude: true %} +## o1-preview + +OpenAI o1-preview scored 79.7% on aider's code editing benchmark, +a state of the art result. +It achieved this result with the +["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format), +where the LLM returns a full copy of the source code file with changes. + +It is much more practical to use aider's +["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format). +which allows the LLM to return search/replace blocks to +efficiently edit the source code. +This saves significant time and token costs. + +Using the diff edit format the o1-preview model had a strong +benchmark score of 75.2%. +This likely places o1-preview between Sonnet and GPT-4o for practical use, +but at significantly higher cost. + +## o1-mini + OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. +It also works best with the whole edit format. -It works best with the -["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format), -where it returns a full copy of the source code file with changes. -Other frontier models like GPT-4o and Sonnet are able to achieve -high benchmark scores using the -["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format), -This allows them to return search/replace blocks to -efficiently edit the source code, saving time and token costs. +## Future work + +The o1-preview model had trouble conforming to aider's diff edit format. The o1-mini model had trouble conforming to both the whole and diff edit formats. Aider is extremely permissive and tries hard to accept anything close to the correct formats. -It's possible that o1-mini would get better scores if aider prompted with -more examples or was adapted to parse o1-mini's favorite ways to mangle -the response formats. -Over time it may be possible to better harness o1-mini's capabilities through -different prompting and editing formats. +It is surprising that such strong models had trouble with +the syntactic requirements of simple text output formats. +It seems likely that aider could optimize its prompts and edit formats to +better harness the o1 models. -## Using aider with o1-mini and o1-preview + +## Using aider with o1 OpenAI's new o1 models are supported in the development version of aider: ``` +# To upgrade to the development version: aider --install-main-branch -# or... + +# Or, to upgrade/install: python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git +# To launch aider with an o1 model: aider --model o1-mini - aider --model o1-preview ```