mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-29 08:44:59 +00:00
copy
This commit is contained in:
parent
6d7e8beaaa
commit
c895e99306
2 changed files with 45 additions and 15 deletions
|
@ -2203,4 +2203,30 @@
|
||||||
date: 2024-12-18
|
date: 2024-12-18
|
||||||
versions: 0.69.2.dev
|
versions: 0.69.2.dev
|
||||||
seconds_per_case: 29.9
|
seconds_per_case: 29.9
|
||||||
total_cost: 0.0000
|
total_cost: 0.0000
|
||||||
|
|
||||||
|
- dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole
|
||||||
|
test_cases: 225
|
||||||
|
model: o1-mini-2024-09-12
|
||||||
|
edit_format: whole
|
||||||
|
commit_hash: a755079-dirty
|
||||||
|
pass_rate_1: 8.9
|
||||||
|
pass_rate_2: 27.1
|
||||||
|
pass_num_1: 20
|
||||||
|
pass_num_2: 61
|
||||||
|
percent_cases_well_formed: 95.6
|
||||||
|
error_outputs: 15
|
||||||
|
num_malformed_responses: 14
|
||||||
|
num_with_malformed_responses: 10
|
||||||
|
user_asks: 37
|
||||||
|
lazy_comments: 0
|
||||||
|
syntax_errors: 0
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 5
|
||||||
|
total_tests: 225
|
||||||
|
command: aider --model o1-mini
|
||||||
|
date: 2024-12-21
|
||||||
|
versions: 0.69.2.dev
|
||||||
|
seconds_per_case: 34.3
|
||||||
|
total_cost: 17.6270
|
|
@ -1,5 +1,6 @@
|
||||||
---
|
---
|
||||||
excerpt: TBD
|
title: o1 tops new aider polyglot leaderboard
|
||||||
|
excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
|
||||||
highlight_image: /assets/polyglot.jpg
|
highlight_image: /assets/polyglot.jpg
|
||||||
draft: false
|
draft: false
|
||||||
nav_exclude: true
|
nav_exclude: true
|
||||||
|
@ -67,10 +68,9 @@ The main goal for a new benchmark
|
||||||
was to re-calibrate the scale so that
|
was to re-calibrate the scale so that
|
||||||
today's top coding LLMs
|
today's top coding LLMs
|
||||||
would occupy a wide range of scores between about 5% and 50%.
|
would occupy a wide range of scores between about 5% and 50%.
|
||||||
A 50% top score from today's best models
|
This should leave headroom for future LLMs and
|
||||||
should leave lots of headroom for future LLMs.
|
make it possible to
|
||||||
And by spreading models across a wide 5-50% range, we
|
more clearly compare the relative performance of top models.
|
||||||
can more clearly compare relative performance.
|
|
||||||
|
|
||||||
## Designing the polyglot benchmark
|
## Designing the polyglot benchmark
|
||||||
|
|
||||||
|
@ -91,9 +91,6 @@ from 6 of the most popular programming languages:
|
||||||
- Rust
|
- Rust
|
||||||
|
|
||||||
Exercism provides a total of 697 coding problems in those 6 languages.
|
Exercism provides a total of 697 coding problems in those 6 languages.
|
||||||
Although many of them are adaptations of the same conceptual problem,
|
|
||||||
just ported into the different languages.
|
|
||||||
|
|
||||||
A set of 7 of today's top coding models each attempted all 697 of
|
A set of 7 of today's top coding models each attempted all 697 of
|
||||||
the Exercism problems:
|
the Exercism problems:
|
||||||
|
|
||||||
|
@ -105,9 +102,9 @@ the Exercism problems:
|
||||||
- Qwen 32B Coder Instruct
|
- Qwen 32B Coder Instruct
|
||||||
- GPT-4o Mini
|
- GPT-4o Mini
|
||||||
|
|
||||||
Based on their results,
|
Depending on the difficulty of the problems,
|
||||||
the 697 coding problems were sorted by how many
|
a different number of solutions were found by the collection of
|
||||||
solutions were found to each problem:
|
7 models:
|
||||||
|
|
||||||
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|
||||||
|--------|-----------|------------|
|
|--------|-----------|------------|
|
||||||
|
@ -122,8 +119,8 @@ solutions were found to each problem:
|
||||||
|
|
||||||
In the table above, you can see that 258 of the problems were solved
|
In the table above, you can see that 258 of the problems were solved
|
||||||
by all 7 LLMs.
|
by all 7 LLMs.
|
||||||
These are far too easy, and wouldn't be good choices for the new benchmark.
|
These problems are far too easy, and wouldn't be good choices for the new benchmark.
|
||||||
Instead, we need the hard problems like the
|
Instead, we need hard problems like the
|
||||||
66 that none of the 7 models were able to solve.
|
66 that none of the 7 models were able to solve.
|
||||||
|
|
||||||
The new benchmark uses
|
The new benchmark uses
|
||||||
|
@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems,
|
||||||
and provides a large but not excessive total pool of problems.
|
and provides a large but not excessive total pool of problems.
|
||||||
It also represents a good diversity of coding languages:
|
It also represents a good diversity of coding languages:
|
||||||
|
|
||||||
| Language | Hard Set |
|
| Language | Problems |
|
||||||
|-------------|----------|
|
|-------------|----------|
|
||||||
| C++ | 26 |
|
| C++ | 26 |
|
||||||
| Go | 39 |
|
| Go | 39 |
|
||||||
|
@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it
|
||||||
will be interesting to see
|
will be interesting to see
|
||||||
how long it will take for this new benchmark to saturate.
|
how long it will take for this new benchmark to saturate.
|
||||||
|
|
||||||
|
## Benchmark problems
|
||||||
|
|
||||||
|
The 225 coding problems are available in the
|
||||||
|
[aider polyglot benchmark repo]()
|
||||||
|
on GitHub.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue