This commit is contained in:
Paul Gauthier 2024-12-22 08:43:59 -05:00
parent 6d7e8beaaa
commit c895e99306
2 changed files with 45 additions and 15 deletions

View file

@ -2204,3 +2204,29 @@
versions: 0.69.2.dev
seconds_per_case: 29.9
total_cost: 0.0000
- dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole
test_cases: 225
model: o1-mini-2024-09-12
edit_format: whole
commit_hash: a755079-dirty
pass_rate_1: 8.9
pass_rate_2: 27.1
pass_num_1: 20
pass_num_2: 61
percent_cases_well_formed: 95.6
error_outputs: 15
num_malformed_responses: 14
num_with_malformed_responses: 10
user_asks: 37
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 5
total_tests: 225
command: aider --model o1-mini
date: 2024-12-21
versions: 0.69.2.dev
seconds_per_case: 34.3
total_cost: 17.6270

View file

@ -1,5 +1,6 @@
---
excerpt: TBD
title: o1 tops new aider polyglot leaderboard
excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
highlight_image: /assets/polyglot.jpg
draft: false
nav_exclude: true
@ -67,10 +68,9 @@ The main goal for a new benchmark
was to re-calibrate the scale so that
today's top coding LLMs
would occupy a wide range of scores between about 5% and 50%.
A 50% top score from today's best models
should leave lots of headroom for future LLMs.
And by spreading models across a wide 5-50% range, we
can more clearly compare relative performance.
This should leave headroom for future LLMs and
make it possible to
more clearly compare the relative performance of top models.
## Designing the polyglot benchmark
@ -91,9 +91,6 @@ from 6 of the most popular programming languages:
- Rust
Exercism provides a total of 697 coding problems in those 6 languages.
Although many of them are adaptations of the same conceptual problem,
just ported into the different languages.
A set of 7 of today's top coding models each attempted all 697 of
the Exercism problems:
@ -105,9 +102,9 @@ the Exercism problems:
- Qwen 32B Coder Instruct
- GPT-4o Mini
Based on their results,
the 697 coding problems were sorted by how many
solutions were found to each problem:
Depending on the difficulty of the problems,
a different number of solutions were found by the collection of
7 models:
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|--------|-----------|------------|
@ -122,8 +119,8 @@ solutions were found to each problem:
In the table above, you can see that 258 of the problems were solved
by all 7 LLMs.
These are far too easy, and wouldn't be good choices for the new benchmark.
Instead, we need the hard problems like the
These problems are far too easy, and wouldn't be good choices for the new benchmark.
Instead, we need hard problems like the
66 that none of the 7 models were able to solve.
The new benchmark uses
@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages:
| Language | Hard Set |
| Language | Problems |
|-------------|----------|
| C++ | 26 |
| Go | 39 |
@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it
will be interesting to see
how long it will take for this new benchmark to saturate.
## Benchmark problems
The 225 coding problems are available in the
[aider polyglot benchmark repo]()
on GitHub.
## Results