This commit is contained in:
Paul Gauthier 2024-12-22 08:43:59 -05:00
parent 6d7e8beaaa
commit c895e99306
2 changed files with 45 additions and 15 deletions

View file

@ -2203,4 +2203,30 @@
date: 2024-12-18 date: 2024-12-18
versions: 0.69.2.dev versions: 0.69.2.dev
seconds_per_case: 29.9 seconds_per_case: 29.9
total_cost: 0.0000 total_cost: 0.0000
- dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole
test_cases: 225
model: o1-mini-2024-09-12
edit_format: whole
commit_hash: a755079-dirty
pass_rate_1: 8.9
pass_rate_2: 27.1
pass_num_1: 20
pass_num_2: 61
percent_cases_well_formed: 95.6
error_outputs: 15
num_malformed_responses: 14
num_with_malformed_responses: 10
user_asks: 37
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 5
total_tests: 225
command: aider --model o1-mini
date: 2024-12-21
versions: 0.69.2.dev
seconds_per_case: 34.3
total_cost: 17.6270

View file

@ -1,5 +1,6 @@
--- ---
excerpt: TBD title: o1 tops new aider polyglot leaderboard
excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
highlight_image: /assets/polyglot.jpg highlight_image: /assets/polyglot.jpg
draft: false draft: false
nav_exclude: true nav_exclude: true
@ -67,10 +68,9 @@ The main goal for a new benchmark
was to re-calibrate the scale so that was to re-calibrate the scale so that
today's top coding LLMs today's top coding LLMs
would occupy a wide range of scores between about 5% and 50%. would occupy a wide range of scores between about 5% and 50%.
A 50% top score from today's best models This should leave headroom for future LLMs and
should leave lots of headroom for future LLMs. make it possible to
And by spreading models across a wide 5-50% range, we more clearly compare the relative performance of top models.
can more clearly compare relative performance.
## Designing the polyglot benchmark ## Designing the polyglot benchmark
@ -91,9 +91,6 @@ from 6 of the most popular programming languages:
- Rust - Rust
Exercism provides a total of 697 coding problems in those 6 languages. Exercism provides a total of 697 coding problems in those 6 languages.
Although many of them are adaptations of the same conceptual problem,
just ported into the different languages.
A set of 7 of today's top coding models each attempted all 697 of A set of 7 of today's top coding models each attempted all 697 of
the Exercism problems: the Exercism problems:
@ -105,9 +102,9 @@ the Exercism problems:
- Qwen 32B Coder Instruct - Qwen 32B Coder Instruct
- GPT-4o Mini - GPT-4o Mini
Based on their results, Depending on the difficulty of the problems,
the 697 coding problems were sorted by how many a different number of solutions were found by the collection of
solutions were found to each problem: 7 models:
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems | | Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|--------|-----------|------------| |--------|-----------|------------|
@ -122,8 +119,8 @@ solutions were found to each problem:
In the table above, you can see that 258 of the problems were solved In the table above, you can see that 258 of the problems were solved
by all 7 LLMs. by all 7 LLMs.
These are far too easy, and wouldn't be good choices for the new benchmark. These problems are far too easy, and wouldn't be good choices for the new benchmark.
Instead, we need the hard problems like the Instead, we need hard problems like the
66 that none of the 7 models were able to solve. 66 that none of the 7 models were able to solve.
The new benchmark uses The new benchmark uses
@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems. and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages: It also represents a good diversity of coding languages:
| Language | Hard Set | | Language | Problems |
|-------------|----------| |-------------|----------|
| C++ | 26 | | C++ | 26 |
| Go | 39 | | Go | 39 |
@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it
will be interesting to see will be interesting to see
how long it will take for this new benchmark to saturate. how long it will take for this new benchmark to saturate.
## Benchmark problems
The 225 coding problems are available in the
[aider polyglot benchmark repo]()
on GitHub.
## Results ## Results