From c895e99306b80b6372c98f7c8f37125f3ad1e7eb Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Sun, 22 Dec 2024 08:43:59 -0500 Subject: [PATCH] copy --- aider/website/_data/edit_leaderboard.yml | 28 +++++++++++++++++- aider/website/_posts/2024-12-21-polyglot.md | 32 ++++++++++++--------- 2 files changed, 45 insertions(+), 15 deletions(-) diff --git a/aider/website/_data/edit_leaderboard.yml b/aider/website/_data/edit_leaderboard.yml index c567300d1..08e333889 100644 --- a/aider/website/_data/edit_leaderboard.yml +++ b/aider/website/_data/edit_leaderboard.yml @@ -2203,4 +2203,30 @@ date: 2024-12-18 versions: 0.69.2.dev seconds_per_case: 29.9 - total_cost: 0.0000 \ No newline at end of file + total_cost: 0.0000 + +- dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole + test_cases: 225 + model: o1-mini-2024-09-12 + edit_format: whole + commit_hash: a755079-dirty + pass_rate_1: 8.9 + pass_rate_2: 27.1 + pass_num_1: 20 + pass_num_2: 61 + percent_cases_well_formed: 95.6 + error_outputs: 15 + num_malformed_responses: 14 + num_with_malformed_responses: 10 + user_asks: 37 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 5 + total_tests: 225 + command: aider --model o1-mini + date: 2024-12-21 + versions: 0.69.2.dev + seconds_per_case: 34.3 + total_cost: 17.6270 \ No newline at end of file diff --git a/aider/website/_posts/2024-12-21-polyglot.md b/aider/website/_posts/2024-12-21-polyglot.md index 639b8041d..8218631b9 100644 --- a/aider/website/_posts/2024-12-21-polyglot.md +++ b/aider/website/_posts/2024-12-21-polyglot.md @@ -1,5 +1,6 @@ --- -excerpt: TBD +title: o1 tops new aider polyglot leaderboard +excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark. highlight_image: /assets/polyglot.jpg draft: false nav_exclude: true @@ -67,10 +68,9 @@ The main goal for a new benchmark was to re-calibrate the scale so that today's top coding LLMs would occupy a wide range of scores between about 5% and 50%. -A 50% top score from today's best models -should leave lots of headroom for future LLMs. -And by spreading models across a wide 5-50% range, we -can more clearly compare relative performance. +This should leave headroom for future LLMs and +make it possible to +more clearly compare the relative performance of top models. ## Designing the polyglot benchmark @@ -91,9 +91,6 @@ from 6 of the most popular programming languages: - Rust Exercism provides a total of 697 coding problems in those 6 languages. -Although many of them are adaptations of the same conceptual problem, -just ported into the different languages. - A set of 7 of today's top coding models each attempted all 697 of the Exercism problems: @@ -105,9 +102,9 @@ the Exercism problems: - Qwen 32B Coder Instruct - GPT-4o Mini -Based on their results, -the 697 coding problems were sorted by how many -solutions were found to each problem: +Depending on the difficulty of the problems, +a different number of solutions were found by the collection of +7 models: | Solutions
found | Number of
problems | Cumulative number
of problems | |--------|-----------|------------| @@ -122,8 +119,8 @@ solutions were found to each problem: In the table above, you can see that 258 of the problems were solved by all 7 LLMs. -These are far too easy, and wouldn't be good choices for the new benchmark. -Instead, we need the hard problems like the +These problems are far too easy, and wouldn't be good choices for the new benchmark. +Instead, we need hard problems like the 66 that none of the 7 models were able to solve. The new benchmark uses @@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems, and provides a large but not excessive total pool of problems. It also represents a good diversity of coding languages: -| Language | Hard Set | +| Language | Problems | |-------------|----------| | C++ | 26 | | Go | 39 | @@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it will be interesting to see how long it will take for this new benchmark to saturate. +## Benchmark problems + +The 225 coding problems are available in the +[aider polyglot benchmark repo]() +on GitHub. + + ## Results