This commit is contained in:
Paul Gauthier 2024-12-21 14:11:54 -08:00
parent ec44850646
commit 8b62d8a6c5
4 changed files with 380 additions and 0 deletions

View file

@ -0,0 +1,203 @@
---
excerpt: TBD
highlight_image: /assets/polyglot.jpg
draft: false
nav_exclude: true
---
{% if page.date %}
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
{% endif %}
# o1 tops new aider polyglot leaderboard
{: .no_toc }
<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
OpenAI's new o1 model with "high" reasoning effort
gets the top score on the
new
[aider polyglot leaderboard](/docs/leaderboard/), significantly ahead of
other top LLMs.
The new polyglot benchmark was designed to be
*much more challenging* than aider's old
[code editing benchmark](/docs/leaderboard/edit.html).
This more clearly distinguishes
the performance of
today's strongest coding models and
leaves headroom for future LLMs.
## The polyglot benchmark
Like aider's original code editing benchmark,
the new polyglot benchmark is based on Exercism
coding exercises.
The new polyglot benchmark:
- Contains coding problems in C++, Go, Java, JavaScript, Python and Rust.
The old benchmark was solely based on Python exercises.
- Focuses on the *most difficult* 225 exercises out of the 697 that
Exercism provides for those languages.
The old benchmark simply included all 133 Python exercises,
regardless of difficulty.
## Motivation and goals
Aider's original code editing benchmark was
saturating as the top scores approached and then surpassed 80%.
Sonnet's score of 84.2% was based on solving 112 of the 133
exercises, leaving only 21 unsolved exercises.
New champions were advancing the top score by
solving just 1-2 more problems than the previous record.
This made it hard to clearly
measure the
difference in code editing skill between these top models.
Part of the problem is that many of the original
133 Python problems are very easy
and provide
little challenge to today's frontier LLMs.
Models as old as GPT 3.5 Turbo were able to solve half of the
133 problems.
Such easy problems simply inflate the benchmark scores
of modern LLMs without
providing any data about which models are better or worse.
The main goal for a new benchmark
was to re-calibrate the scale so that
today's top coding LLMs
would occupy a wide range of scores between about 5% and 50%.
A 50% top score from today's best models
should leave lots of headroom for future LLMs.
And by spreading models across a wide 5-50% range, we
can more clearly compare relative performance.
## Designing the polyglot benchmark
The new benchmark:
- Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
- Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today's top coding LLMs.
- Includes more total coding problems, to enable more granularity of comparison.
The new benchmark is based on Exercism coding problems
from 6 of the most popular programming languages:
- C++
- Go
- Java
- JavaScript
- Python
- Rust
Exercism provides a total of 697 coding problems in those 6 languages.
Although many of them are adaptations of the same conceptual problem,
just ported into the different languages.
A set of 7 of today's top coding models each attempted all 697 of
the Exercism problems:
- Sonnet
- Haiku
- o1 Mini
- DeepSeek
- GPT-4o
- Qwen 32B Coder Instruct
- GPT-4o Mini
Based on their results,
the 697 coding problems were sorted by how many
solutions were found to each problem:
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|--------|-----------|------------|
| 0 | 66 | 66 |
| 1 | 61 | 127 |
| 2 | 50 | 177 |
| 3 | 48 | 225 |
| 4 | 53 | 278 |
| 5 | 71 | 349 |
| 6 | 90 | 439 |
| 7 | 258 | 697 |
In the table above, you can see that 258 of the problems were solved
by all 7 LLMs.
These are far too easy, and wouldn't be good choices for the new benchmark.
Instead, we need the hard problems like the
66 that none of the 7 models were able to solve.
The new benchmark uses
the 225 problems that were solved by 3 or fewer models.
This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages:
| Language | Hard Set |
|-------------|----------|
| C++ | 26 |
| Go | 39 |
| Java | 47 |
| JavaScript | 49 |
| Python | 34 |
| Rust | 30 |
| **Total** | **225** |
## o1
OpenAI's new o1 model established a very strong
top score of 62% on the new benchmark.
This still leaves 86 problems of headroom for future models
to solve.
Given the incredible pace of recent advancements, it
will be interesting to see
how long it will take for this new benchmark to saturate.
## Results
<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
<thead style="background-color: #f2f2f2;">
<tr>
<th style="padding: 8px; text-align: left;">Model</th>
<th style="padding: 8px; text-align: center;">Percent completed correctly</th>
<th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
<th style="padding: 8px; text-align: left;">Command</th>
<th style="padding: 8px; text-align: center;">Edit format</th>
</tr>
</thead>
<tbody>
{% assign edit_sorted = site.data.polyglot_leaderboard | sort: 'pass_rate_2' | reverse %}
{% for row in edit_sorted %}
<tr style="border-bottom: 1px solid #ddd;">
<td style="padding: 8px;">{{ row.model }}</td>
<td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
<td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
<td style="padding: 8px;"><code>{{ row.command }}</code></td>
<td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
</tr>
{% endfor %}
</tbody>
</table>
<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script>
{% assign data_source = edit_sorted %}
{% assign pass_rate_field = "pass_rate_2" %}
{% include leaderboard.js %}
</script>
<style>
tr.selected {
color: #0056b3;
}
table {
table-layout: fixed;
}
td, th {
word-wrap: break-word;
overflow-wrap: break-word;
}
td:nth-child(3), td:nth-child(4) {
font-size: 12px;
}
</style>