copy

2025-05-31 01:35:00 +00:00 · 2024-12-21 14:11:54 -08:00 · 2024-12-21 14:11:54 -08:00 · 8b62d8a6c5
commit 8b62d8a6c5
parent ec44850646
4 changed files with 380 additions and 0 deletions
--- a/aider/website/_posts/2024-12-21-polyglot.md
+++ b/aider/website/_posts/2024-12-21-polyglot.md
@ -0,0 +1,203 @@
+---
+excerpt: TBD
+highlight_image: /assets/polyglot.jpg
+draft: false
+nav_exclude: true
+---
+{% if page.date %}
+<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
+{% endif %}
+
+# o1 tops new aider polyglot leaderboard
+{: .no_toc }
+
+<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
+
+OpenAI's new o1 model with "high" reasoning effort
+gets the top score on the
+new 
+[aider polyglot leaderboard](/docs/leaderboard/), significantly ahead of
+other top LLMs.
+The new polyglot benchmark was designed to be 
+*much more challenging* than aider's old
+[code editing benchmark](/docs/leaderboard/edit.html).
+This more clearly distinguishes 
+the performance of
+today's strongest coding models and
+leaves headroom for future LLMs.
+
+## The polyglot benchmark
+
+Like aider's original code editing benchmark,
+the new polyglot benchmark is based on Exercism
+coding exercises.
+
+The new polyglot benchmark:
+
+- Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. 
+The old benchmark was solely based on Python exercises.
+- Focuses on the *most difficult* 225 exercises out of the 697 that
+Exercism provides for those languages.
+The old benchmark simply included all 133 Python exercises,
+regardless of difficulty.
+
+## Motivation and goals
+
+Aider's original code editing benchmark was 
+saturating as the top scores approached and then surpassed 80%.
+Sonnet's score of 84.2% was based on solving 112 of the 133
+exercises, leaving only 21 unsolved exercises.
+New champions were advancing the top score by
+solving just 1-2 more problems than the previous record.
+This made it hard to clearly 
+measure the
+difference in code editing skill between these top models.
+
+Part of the problem is that many of the original
+133 Python problems are very easy 
+and provide
+little challenge to today's frontier LLMs.
+Models as old as GPT 3.5 Turbo were able to solve half of the
+133 problems.
+Such easy problems simply inflate the benchmark scores 
+of modern LLMs without
+providing any data about which models are better or worse.
+
+The main goal for a new benchmark 
+was to re-calibrate the scale so that
+today's top coding LLMs 
+would occupy a wide range of scores between about 5% and 50%.
+A 50% top score from today's best models
+should leave lots of headroom for future LLMs.
+And by spreading models across a wide 5-50% range, we
+can more clearly compare relative performance.
+
+## Designing the polyglot benchmark
+
+The new benchmark:
+
+- Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
+- Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today's top coding LLMs.
+- Includes more total coding problems, to enable more granularity of comparison.
+
+The new benchmark is based on Exercism coding problems
+from 6 of the most popular programming languages:
+
+- C++ 
+- Go 
+- Java
+- JavaScript
+- Python
+- Rust
+
+Exercism provides a total of 697 coding problems in those 6 languages.
+Although many of them are adaptations of the same conceptual problem,
+just ported into the different languages.
+
+A set of 7 of today's top coding models each attempted all 697 of
+the Exercism problems:
+
+- Sonnet
+- Haiku
+- o1 Mini
+- DeepSeek
+- GPT-4o
+- Qwen 32B Coder Instruct
+- GPT-4o Mini
+
+Based on their results, 
+the 697 coding problems were sorted by how many 
+solutions were found to each problem:
+
+| Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
+|--------|-----------|------------|
+| 0      | 66        | 66         |
+| 1      | 61        | 127        |
+| 2      | 50        | 177        |
+| 3      | 48        | 225        |
+| 4      | 53        | 278        |
+| 5      | 71        | 349        |
+| 6      | 90        | 439        |
+| 7      | 258       | 697        |
+
+In the table above, you can see that 258 of the problems were solved
+by all 7 LLMs.
+These are far too easy, and wouldn't be good choices for the new benchmark.
+Instead, we need the hard problems like the
+66 that none of the 7 models were able to solve.
+
+The new benchmark uses 
+the 225 problems that were solved by 3 or fewer models.
+This achieves a balance between hard and moderate problems,
+and provides a large but not excessive total pool of problems.
+It also represents a good diversity of coding languages:
+
+| Language    | Hard Set |
+|-------------|----------|
+| C++         | 26       |
+| Go          | 39       |
+| Java        | 47       |
+| JavaScript  | 49       |
+| Python      | 34       |
+| Rust        | 30       |
+| **Total**   | **225**  |
+
+## o1
+
+OpenAI's new o1 model established a very strong
+top score of 62% on the new benchmark.
+This still leaves 86 problems of headroom for future models
+to solve.
+Given the incredible pace of recent advancements, it
+will be interesting to see
+how long it will take for this new benchmark to saturate.
+
+
+## Results
+
+<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
+  <thead style="background-color: #f2f2f2;">
+    <tr>
+      <th style="padding: 8px; text-align: left;">Model</th>
+      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
+      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
+      <th style="padding: 8px; text-align: left;">Command</th>
+      <th style="padding: 8px; text-align: center;">Edit format</th>
+    </tr>
+  </thead>
+  <tbody>
+    {% assign edit_sorted = site.data.polyglot_leaderboard | sort: 'pass_rate_2' | reverse %}
+    {% for row in edit_sorted %}
+      <tr style="border-bottom: 1px solid #ddd;">
+        <td style="padding: 8px;">{{ row.model }}</td>
+        <td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
+        <td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
+        <td style="padding: 8px;"><code>{{ row.command }}</code></td>
+        <td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
+      </tr>
+    {% endfor %}
+  </tbody>
+</table>
+
+<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>
+<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+<script>
+{% assign data_source = edit_sorted %}
+{% assign pass_rate_field = "pass_rate_2" %}
+{% include leaderboard.js %}
+</script>
+<style>
+  tr.selected {
+    color: #0056b3;
+  }
+  table {
+    table-layout: fixed;
+  }
+  td, th {
+    word-wrap: break-word;
+    overflow-wrap: break-word;
+  }
+  td:nth-child(3), td:nth-child(4) {
+    font-size: 12px;
+  }
+</style>
--- a/aider/website/docs/leaderboards/contrib.md
+++ b/aider/website/docs/leaderboards/contrib.md
@ -0,0 +1,14 @@
+---
+parent: Aider LLM Leaderboards
+nav_order: 900
+---
+
+# Contributing results
+
+Contributions of benchmark results are welcome!
+See the
+[benchmark README](https://github.com/Aider-AI/aider/blob/main/benchmark/README.md)
+for information on running aider's code editing benchmarks.
+Submit results by opening a PR with edits to the
+[benchmark results data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).
+
--- a/aider/website/docs/leaderboards/edit.md
+++ b/aider/website/docs/leaderboards/edit.md
@ -0,0 +1,134 @@
+---
+parent: Aider LLM Leaderboards
+highlight_image: /assets/leaderboard.jpg
+nav_order: 50
+description: Quantitative benchmark of basic LLM code editing skill.
+---
+
+# Code editing leaderboard
+
+
+{: .note :}
+This old
+[aider code editing leaderboard](edit.html)
+has been replaced by the
+new, much more challenging
+[polyglot leaderboard](/docs/leaderboard/).
+
+[Aider's code editing benchmark](/docs/benchmarks.html#the-benchmark) asks the LLM to edit python source files to complete 133 small coding exercises
+from Exercism. 
+This measures the LLM's coding ability, and whether it can
+write new code that integrates into existing code.
+The model also has to successfully apply all its changes to the source file without human intervention.
+
+<input type="text" id="editSearchInput" placeholder="Search..." style="width: 100%; max-width: 800px; margin: 10px auto; padding: 8px; display: block; border: 1px solid #ddd; border-radius: 4px;">
+
+<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
+  <thead style="background-color: #f2f2f2;">
+    <tr>
+      <th style="padding: 8px; text-align: left;">Model</th>
+      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
+      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
+      <th style="padding: 8px; text-align: left;">Command</th>
+      <th style="padding: 8px; text-align: center;">Edit format</th>
+    </tr>
+  </thead>
+  <tbody>
+    {% assign edit_sorted = site.data.edit_leaderboard | sort: 'pass_rate_2' | reverse %}
+    {% for row in edit_sorted %}
+      <tr style="border-bottom: 1px solid #ddd;">
+        <td style="padding: 8px;">{{ row.model }}</td>
+        <td style="padding: 8px; text-align: center;">{{ row.pass_rate_2 }}%</td>
+        <td style="padding: 8px; text-align: center;">{{ row.percent_cases_well_formed }}%</td>
+        <td style="padding: 8px;"><code>{{ row.command }}</code></td>
+        <td style="padding: 8px; text-align: center;">{{ row.edit_format }}</td>
+      </tr>
+    {% endfor %}
+  </tbody>
+</table>
+
+<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
+<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>
+<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+<script>
+{% assign data_source = edit_sorted %}
+{% assign pass_rate_field = "pass_rate_2" %}
+{% include leaderboard.js %}
+</script>
+<style>
+  tr.selected {
+    color: #0056b3;
+  }
+  table {
+    table-layout: fixed;
+  }
+  td, th {
+    word-wrap: break-word;
+    overflow-wrap: break-word;
+  }
+  td:nth-child(3), td:nth-child(4) {
+    font-size: 12px;
+  }
+</style>
+
+
+## Notes on benchmarking results
+
+The key benchmarking results are:
+
+- **Percent completed correctly** - Measures what percentage of the coding tasks that the LLM completed successfully. To complete a task, the LLM must solve the programming assignment *and* edit the code to implement that solution.
+- **Percent using correct edit format** - Measures the percent of coding tasks where the LLM complied with the edit format specified in the system prompt. If the LLM makes edit mistakes, aider will give it feedback and ask for a fixed copy of the edit. The best models can reliably conform to the edit format, without making errors.
+
+
+## Notes on the edit format
+
+Aider uses different "edit formats" to collect code edits from different LLMs.
+The "whole" format is the easiest for an LLM to use, but it uses a lot of tokens
+and may limit how large a file can be edited.
+Models which can use one of the diff formats are much more efficient,
+using far fewer tokens.
+Models that use a diff-like format are able to 
+edit larger files with less cost and without hitting token limits.
+
+Aider is configured to use the best edit format for the popular OpenAI and Anthropic models
+and the [other models recommended on the LLM page](/docs/llms.html).
+For lesser known models aider will default to using the "whole" editing format
+since it is the easiest format for an LLM to use.
+
+## Contributing benchmark results
+
+Contributions of benchmark results are welcome!
+See the
+[benchmark README](https://github.com/Aider-AI/aider/blob/main/benchmark/README.md)
+for information on running aider's code editing benchmarks.
+Submit results by opening a PR with edits to the
+[benchmark results data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).
+
+
+<p class="post-date">
+By Paul Gauthier,
+last updated
+<!--[[[cog
+import subprocess
+import datetime
+
+files = [
+    'aider/website/docs/leaderboards/index.md',
+    'aider/website/_data/edit_leaderboard.yml',
+    'aider/website/_data/refactor_leaderboard.yml'
+]
+
+def get_last_modified_date(file):
+    result = subprocess.run(['git', 'log', '-1', '--format=%ct', file], capture_output=True, text=True)
+    if result.returncode == 0:
+        timestamp = int(result.stdout.strip())
+        return datetime.datetime.fromtimestamp(timestamp)
+    return datetime.datetime.min
+
+mod_dates = [get_last_modified_date(file) for file in files]
+latest_mod_date = max(mod_dates)
+cog.out(f"{latest_mod_date.strftime('%B %d, %Y.')}")
+]]]-->
+December 16, 2024.
+<!--[[[end]]]-->
+</p>
--- a/aider/website/docs/leaderboards/notes.md
+++ b/aider/website/docs/leaderboards/notes.md
@ -0,0 +1,29 @@
+---
+parent: Aider LLM Leaderboards
+nav_order: 800
+---
+
+# Benchmark notes
+
+## Notes on benchmarking results
+
+The key benchmarking results are:
+
+- **Percent completed correctly** - Measures what percentage of the coding tasks that the LLM completed successfully. To complete a task, the LLM must solve the programming assignment *and* edit the code to implement that solution.
+- **Percent using correct edit format** - Measures the percent of coding tasks where the LLM complied with the edit format specified in the system prompt. If the LLM makes edit mistakes, aider will give it feedback and ask for a fixed copy of the edit. The best models can reliably conform to the edit format, without making errors.
+
+
+## Notes on the edit format
+
+Aider uses different "edit formats" to collect code edits from different LLMs.
+The "whole" format is the easiest for an LLM to use, but it uses a lot of tokens
+and may limit how large a file can be edited.
+Models which can use one of the diff formats are much more efficient,
+using far fewer tokens.
+Models that use a diff-like format are able to 
+edit larger files with less cost and without hitting token limits.
+
+Aider is configured to use the best edit format for the popular OpenAI and Anthropic models
+and the [other models recommended on the LLM page](/docs/llms.html).
+For lesser known models aider will default to using the "whole" editing format
+since it is the easiest format for an LLM to use.