This commit is contained in:
Paul Gauthier 2024-09-12 15:41:02 -07:00
parent c00ac80909
commit 72f52bdef0

View file

@ -9,6 +9,17 @@ nav_exclude: true
# Benchmark results for OpenAI o1-mini
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
{% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %}
{% include leaderboard_graph.html
chart_id="editChart"
data=edit_sorted
row_prefix="edit-row"
pass_rate_key="pass_rate_2"
%}
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models.
@ -24,10 +35,10 @@ efficiently edit the source code, saving time and token costs.
The o1-mini model had trouble conforming to both the whole and diff edit formats.
Aider is extremely permissive and tries hard to accept anything close
to the correct formats.
It's possible that o1-mini would get better scores if aider prompted with
more examples or was adapted to parse o1-mini's favorite ways to mangle
the response formats.
Over time it may be possible to better harness o1-mini's capabilities through
different prompting and editing formats.
@ -49,6 +60,7 @@ aider --model o1-preview
> These are *preliminiary* benchmark results, which will be updated as
> additional benchmark runs complete and rate limits open up.
<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
<thead style="background-color: #f2f2f2;">
<tr>
@ -60,7 +72,6 @@ aider --model o1-preview
</tr>
</thead>
<tbody>
{% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %}
{% for row in edit_sorted %}
<tr style="border-bottom: 1px solid #ddd;">
<td style="padding: 8px;">{{ row.model }}</td>
@ -73,14 +84,6 @@ aider --model o1-preview
</tbody>
</table>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
{% include leaderboard_graph.html
chart_id="editChart"
data=edit_sorted
row_prefix="edit-row"
pass_rate_key="pass_rate_2"
%}
<style>
tr.selected {