mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-10 22:55:00 +00:00
copy
This commit is contained in:
parent
ec44850646
commit
8b62d8a6c5
4 changed files with 380 additions and 0 deletions
203
aider/website/_posts/2024-12-21-polyglot.md
Normal file
203
aider/website/_posts/2024-12-21-polyglot.md
Normal file
|
@ -0,0 +1,203 @@
|
|||
---
|
||||
excerpt: TBD
|
||||
highlight_image: /assets/polyglot.jpg
|
||||
draft: false
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# o1 tops new aider polyglot leaderboard
|
||||
{: .no_toc }
|
||||
|
||||
<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
|
||||
|
||||
OpenAI's new o1 model with "high" reasoning effort
|
||||
gets the top score on the
|
||||
new
|
||||
[aider polyglot leaderboard](/docs/leaderboard/), significantly ahead of
|
||||
other top LLMs.
|
||||
The new polyglot benchmark was designed to be
|
||||
*much more challenging* than aider's old
|
||||
[code editing benchmark](/docs/leaderboard/edit.html).
|
||||
This more clearly distinguishes
|
||||
the performance of
|
||||
today's strongest coding models and
|
||||
leaves headroom for future LLMs.
|
||||
|
||||
## The polyglot benchmark
|
||||
|
||||
Like aider's original code editing benchmark,
|
||||
the new polyglot benchmark is based on Exercism
|
||||
coding exercises.
|
||||
|
||||
The new polyglot benchmark:
|
||||
|
||||
- Contains coding problems in C++, Go, Java, JavaScript, Python and Rust.
|
||||
The old benchmark was solely based on Python exercises.
|
||||
- Focuses on the *most difficult* 225 exercises out of the 697 that
|
||||
Exercism provides for those languages.
|
||||
The old benchmark simply included all 133 Python exercises,
|
||||
regardless of difficulty.
|
||||
|
||||
## Motivation and goals
|
||||
|
||||
Aider's original code editing benchmark was
|
||||
saturating as the top scores approached and then surpassed 80%.
|
||||
Sonnet's score of 84.2% was based on solving 112 of the 133
|
||||
exercises, leaving only 21 unsolved exercises.
|
||||
New champions were advancing the top score by
|
||||
solving just 1-2 more problems than the previous record.
|
||||
This made it hard to clearly
|
||||
measure the
|
||||
difference in code editing skill between these top models.
|
||||
|
||||
Part of the problem is that many of the original
|
||||
133 Python problems are very easy
|
||||
and provide
|
||||
little challenge to today's frontier LLMs.
|
||||
Models as old as GPT 3.5 Turbo were able to solve half of the
|
||||
133 problems.
|
||||
Such easy problems simply inflate the benchmark scores
|
||||
of modern LLMs without
|
||||
providing any data about which models are better or worse.
|
||||
|
||||
The main goal for a new benchmark
|
||||
was to re-calibrate the scale so that
|
||||
today's top coding LLMs
|
||||
would occupy a wide range of scores between about 5% and 50%.
|
||||
A 50% top score from today's best models
|
||||
should leave lots of headroom for future LLMs.
|
||||
And by spreading models across a wide 5-50% range, we
|
||||
can more clearly compare relative performance.
|
||||
|
||||
## Designing the polyglot benchmark
|
||||
|
||||
The new benchmark:
|
||||
|
||||
- Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
|
||||
- Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today's top coding LLMs.
|
||||
- Includes more total coding problems, to enable more granularity of comparison.
|
||||
|
||||
The new benchmark is based on Exercism coding problems
|
||||
from 6 of the most popular programming languages:
|
||||
|
||||
- C++
|
||||
- Go
|
||||
- Java
|
||||
- JavaScript
|
||||
- Python
|
||||
- Rust
|
||||
|
||||
Exercism provides a total of 697 coding problems in those 6 languages.
|
||||
Although many of them are adaptations of the same conceptual problem,
|
||||
just ported into the different languages.
|
||||
|
||||
A set of 7 of today's top coding models each attempted all 697 of
|
||||
the Exercism problems:
|
||||
|
||||
- Sonnet
|
||||
- Haiku
|
||||
- o1 Mini
|
||||
- DeepSeek
|
||||
- GPT-4o
|
||||
- Qwen 32B Coder Instruct
|
||||
- GPT-4o Mini
|
||||
|
||||
Based on their results,
|
||||
the 697 coding problems were sorted by how many
|
||||
solutions were found to each problem:
|
||||
|
||||
| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
|
||||
|--------|-----------|------------|
|
||||
| 0 | 66 | 66 |
|
||||
| 1 | 61 | 127 |
|
||||
| 2 | 50 | 177 |
|
||||
| 3 | 48 | 225 |
|
||||
| 4 | 53 | 278 |
|
||||
| 5 | 71 | 349 |
|
||||
| 6 | 90 | 439 |
|
||||
| 7 | 258 | 697 |
|
||||
|
||||
In the table above, you can see that 258 of the problems were solved
|
||||
by all 7 LLMs.
|
||||
These are far too easy, and wouldn't be good choices for the new benchmark.
|
||||
Instead, we need the hard problems like the
|
||||
66 that none of the 7 models were able to solve.
|
||||
|
||||
The new benchmark uses
|
||||
the 225 problems that were solved by 3 or fewer models.
|
||||
This achieves a balance between hard and moderate problems,
|
||||
and provides a large but not excessive total pool of problems.
|
||||
It also represents a good diversity of coding languages:
|
||||
|
||||
| Language | Hard Set |
|
||||
|-------------|----------|
|
||||
| C++ | 26 |
|
||||
| Go | 39 |
|
||||
| Java | 47 |
|
||||
| JavaScript | 49 |
|
||||
| Python | 34 |
|
||||
| Rust | 30 |
|
||||
| **Total** | **225** |
|
||||
|
||||
## o1
|
||||
|
||||
OpenAI's new o1 model established a very strong
|
||||
top score of 62% on the new benchmark.
|
||||
This still leaves 86 problems of headroom for future models
|
||||
to solve.
|
||||
Given the incredible pace of recent advancements, it
|
||||
will be interesting to see
|
||||
how long it will take for this new benchmark to saturate.
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
|
||||
<thead style="background-color: #f2f2f2;">
|
||||
<tr>
|
||||
<th style="padding: 8px; text-align: left;">Model</th>
|
||||
<th style="padding: 8px; text-align: center;">Percent completed correctly</th>
|
||||
<th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
|
||||
<th style="padding: 8px; text-align: left;">Command</th>
|
||||
<th style="padding: 8px; text-align: center;">Edit format</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% assign edit_sorted = site.data.polyglot_leaderboard | sort: 'pass_rate_2' | reverse %}
|
||||
{% for row in edit_sorted %}
|
||||
<tr style="border-bottom: 1px solid #ddd;">
|
||||
<td style="padding: 8px;">{{ row.model }}</td>
|
||||
<td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
|
||||
<td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
|
||||
<td style="padding: 8px;"><code>{{ row.command }}</code></td>
|
||||
<td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>
|
||||
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
||||
<script>
|
||||
{% assign data_source = edit_sorted %}
|
||||
{% assign pass_rate_field = "pass_rate_2" %}
|
||||
{% include leaderboard.js %}
|
||||
</script>
|
||||
<style>
|
||||
tr.selected {
|
||||
color: #0056b3;
|
||||
}
|
||||
table {
|
||||
table-layout: fixed;
|
||||
}
|
||||
td, th {
|
||||
word-wrap: break-word;
|
||||
overflow-wrap: break-word;
|
||||
}
|
||||
td:nth-child(3), td:nth-child(4) {
|
||||
font-size: 12px;
|
||||
}
|
||||
</style>
|
Loading…
Add table
Add a link
Reference in a new issue