aider/aider/website/_posts/2024-09-12-o1.md
Paul Gauthier 51017d7a5b copy
2024-09-20 11:05:33 -07:00

121 lines
3.7 KiB
Markdown

---
title: o1-preview is SOTA on the aider leaderboard
excerpt: Preliminary benchmark results for the new OpenAI o1 models.
nav_exclude: true
---
{% if page.date %}
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
{% endif %}
# OpenAI o1-preview is SOTA on the aider leaderboard
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
{% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %}
{% include leaderboard_graph.html
chart_id="editChart"
data=edit_sorted
row_prefix="edit-row"
pass_rate_key="pass_rate_2"
%}
## o1-preview
OpenAI o1-preview scored 79.7% on aider's code editing benchmark,
a state of the art result.
It achieved this result with the
["whole" edit format](/docs/leaderboards/#notes-on-the-edit-format),
where the LLM returns a full copy of the source code file with changes.
It is much more practical to use aider's
["diff" edit format](/docs/leaderboards/#notes-on-the-edit-format),
which allows the LLM to return search/replace blocks to
efficiently edit the source code.
This saves significant time and token costs.
Using the diff edit format the o1-preview model had a strong
benchmark score of 75.2%.
This likely places o1-preview between Sonnet and GPT-4o for practical use,
but at significantly higher cost.
## o1-mini
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models.
It also works best with the whole edit format.
## Future work
The o1-preview model had trouble conforming to aider's diff edit format.
The o1-mini model had trouble conforming to both the whole and diff edit formats.
Aider is extremely permissive and tries hard to accept anything close
to the correct formats.
It is surprising that such strong models had trouble with
the syntactic requirements of simple text output formats.
It seems likely that aider could optimize its prompts and edit formats to
better harness the o1 models.
## Using aider with o1
OpenAI's new o1 models are supported in the development version of aider:
```
# To upgrade to the development version:
aider --install-main-branch
# Or, to upgrade/install:
python -m pip install --upgrade git+https://github.com/paul-gauthier/aider.git
# To launch aider with an o1 model:
aider --model o1-mini
aider --model o1-preview
```
{: .note }
> These are *preliminiary* benchmark results, which will be updated as
> additional benchmark runs complete and rate limits open up.
<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
<thead style="background-color: #f2f2f2;">
<tr>
<th style="padding: 8px; text-align: left;">Model</th>
<th style="padding: 8px; text-align: center;">Percent completed correctly</th>
<th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
<th style="padding: 8px; text-align: left;">Command</th>
<th style="padding: 8px; text-align: center;">Edit format</th>
</tr>
</thead>
<tbody>
{% for row in edit_sorted %}
<tr style="border-bottom: 1px solid #ddd;">
<td style="padding: 8px;">{{ row.model }}</td>
<td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
<td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
<td style="padding: 8px;"><code>{{ row.command }}</code></td>
<td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
</tr>
{% endfor %}
</tbody>
</table>
<style>
tr.selected {
color: #0056b3;
}
table {
table-layout: fixed;
}
td, th {
word-wrap: break-word;
overflow-wrap: break-word;
}
td:nth-child(3), td:nth-child(4) {
font-size: 12px;
}
</style>