o1-mini blog article

2025-06-10 22:55:00 +00:00 · 2024-09-12 14:07:06 -07:00 · 2024-09-12 14:07:06 -07:00 · 96587f5f46
commit 96587f5f46
parent 291b456a45
3 changed files with 181 additions and 1 deletions
--- a/aider/website/_posts/2024-09-12-o1.md
+++ b/aider/website/_posts/2024-09-12-o1.md
@ -0,0 +1,82 @@
+---
+title: Benchmark results for OpenAI o1-mini
+excerpt: Preliminary benchmark results for the new OpenAI o1-mini model.
+nav_exclude: true
+---
+{% if page.date %}
+<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
+{% endif %}
+
+# Benchmark results for OpenAI o1-mini
+
+OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet. 
+o1-mini scored below those models
+when using the simple "whole" editing format.
+It was close enough to GPT-4o to be within the margin of error.
+
+The o1-mini model had trouble following the very simple whole editing format.
+It's possible that it would get a better score if aider prompted with
+more examples or was adapted to parse o1-mini's favorite way to mangle
+the response format.
+
+Note that o1-mini's "whole" score is compared against GPT-4o and Sonnet 
+"diff" results.
+Using diff is more challenging for GPT-4o and Sonnet,
+but it allows them to return search/replace blocks to 
+efficiently edit the source code.
+The whole format requires the o1-mini to return a fresh copy of the entire file,
+increasing costs and latency.
+
+
+{: .note }
+> These are *preliminiary* benchmark results, which will be updated as
+> additional benchmark runs complete and rate limits open up.
+
+<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
+  <thead style="background-color: #f2f2f2;">
+    <tr>
+      <th style="padding: 8px; text-align: left;">Model</th>
+      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
+      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
+      <th style="padding: 8px; text-align: left;">Command</th>
+      <th style="padding: 8px; text-align: center;">Edit format</th>
+    </tr>
+  </thead>
+  <tbody>
+    {% assign edit_sorted = site.data.o1_results | sort: 'pass_rate_2' | reverse %}
+    {% for row in edit_sorted %}
+      <tr style="border-bottom: 1px solid #ddd;">
+        <td style="padding: 8px;">{{ row.model }}</td>
+        <td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
+        <td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
+        <td style="padding: 8px;"><code>{{ row.command }}</code></td>
+        <td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
+      </tr>
+    {% endfor %}
+  </tbody>
+</table>
+
+<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+
+{% include leaderboard_graph.html
+  chart_id="editChart" 
+  data=edit_sorted 
+  row_prefix="edit-row" 
+  pass_rate_key="pass_rate_2"
+%}
+
+<style>
+  tr.selected {
+    color: #0056b3;
+  }
+  table {
+    table-layout: fixed;
+  }
+  td, th {
+    word-wrap: break-word;
+    overflow-wrap: break-word;
+  }
+  td:nth-child(3), td:nth-child(4) {
+    font-size: 12px;
+  }
+</style>