mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-13 08:05:01 +00:00
121 lines
3.7 KiB
Markdown
121 lines
3.7 KiB
Markdown
---
|
|
title: o1-preview is SOTA on the aider leaderboard
|
|
excerpt: Preliminary benchmark results for the new OpenAI o1 models.
|
|
nav_exclude: true
|
|
---
|
|
{% if page.date %}
|
|
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
|
{% endif %}
|
|
|
|
# OpenAI o1-preview is SOTA on the aider leaderboard
|
|
|
|
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
|
|
|
{% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %}
|
|
{% include leaderboard_graph.html
|
|
chart_id="editChart"
|
|
data=edit_sorted
|
|
row_prefix="edit-row"
|
|
pass_rate_key="pass_rate_2"
|
|
%}
|
|
|
|
|
|
## o1-preview
|
|
|
|
OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
|
|
a state of the art result.
|
|
It achieved this result with the
|
|
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
|
where the LLM returns a full copy of the source code file with changes.
|
|
|
|
It is much more practical to use aider's
|
|
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
|
|
which allows the LLM to return search/replace blocks to
|
|
efficiently edit the source code.
|
|
This saves significant time and token costs.
|
|
|
|
Using the diff edit format the o1-preview model had a strong
|
|
benchmark score of 75.2%.
|
|
This likely places o1-preview between Sonnet and GPT-4o for practical use,
|
|
but at significantly higher cost.
|
|
|
|
## o1-mini
|
|
|
|
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
|
|
but scored below those models.
|
|
It also works best with the whole edit format.
|
|
|
|
|
|
## Future work
|
|
|
|
The o1-preview model had trouble conforming to aider's diff edit format.
|
|
The o1-mini model had trouble conforming to both the whole and diff edit formats.
|
|
Aider is extremely permissive and tries hard to accept anything close
|
|
to the correct formats.
|
|
|
|
It is surprising that such strong models had trouble with
|
|
the syntactic requirements of simple text output formats.
|
|
It seems likely that aider could optimize its prompts and edit formats to
|
|
better harness the o1 models.
|
|
|
|
|
|
## Using aider with o1
|
|
|
|
OpenAI's new o1 models are supported in the development version of aider:
|
|
|
|
```
|
|
# To upgrade to the development version:
|
|
aider --install-main-branch
|
|
|
|
# Or, to upgrade/install:
|
|
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
|
|
|
|
# To launch aider with an o1 model:
|
|
aider --model o1-mini
|
|
aider --model o1-preview
|
|
```
|
|
|
|
{: .note }
|
|
> These are *preliminiary* benchmark results, which will be updated as
|
|
> additional benchmark runs complete and rate limits open up.
|
|
|
|
|
|
<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
|
|
<thead style="background-color: #f2f2f2;">
|
|
<tr>
|
|
<th style="padding: 8px; text-align: left;">Model</th>
|
|
<th style="padding: 8px; text-align: center;">Percent completed correctly</th>
|
|
<th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
|
|
<th style="padding: 8px; text-align: left;">Command</th>
|
|
<th style="padding: 8px; text-align: center;">Edit format</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
{% for row in edit_sorted %}
|
|
<tr style="border-bottom: 1px solid #ddd;">
|
|
<td style="padding: 8px;">{{ row.model }}</td>
|
|
<td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
|
|
<td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
|
|
<td style="padding: 8px;"><code>{{ row.command }}</code></td>
|
|
<td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
|
|
</tr>
|
|
{% endfor %}
|
|
</tbody>
|
|
</table>
|
|
|
|
|
|
<style>
|
|
tr.selected {
|
|
color: #0056b3;
|
|
}
|
|
table {
|
|
table-layout: fixed;
|
|
}
|
|
td, th {
|
|
word-wrap: break-word;
|
|
overflow-wrap: break-word;
|
|
}
|
|
td:nth-child(3), td:nth-child(4) {
|
|
font-size: 12px;
|
|
}
|
|
</style>
|