diff --git a/aider/website/_data/edit_leaderboard.yml b/aider/website/_data/edit_leaderboard.yml
index c567300d1..08e333889 100644
--- a/aider/website/_data/edit_leaderboard.yml
+++ b/aider/website/_data/edit_leaderboard.yml
@@ -2203,4 +2203,30 @@
date: 2024-12-18
versions: 0.69.2.dev
seconds_per_case: 29.9
- total_cost: 0.0000
\ No newline at end of file
+ total_cost: 0.0000
+
+- dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole
+ test_cases: 225
+ model: o1-mini-2024-09-12
+ edit_format: whole
+ commit_hash: a755079-dirty
+ pass_rate_1: 8.9
+ pass_rate_2: 27.1
+ pass_num_1: 20
+ pass_num_2: 61
+ percent_cases_well_formed: 95.6
+ error_outputs: 15
+ num_malformed_responses: 14
+ num_with_malformed_responses: 10
+ user_asks: 37
+ lazy_comments: 0
+ syntax_errors: 0
+ indentation_errors: 0
+ exhausted_context_windows: 0
+ test_timeouts: 5
+ total_tests: 225
+ command: aider --model o1-mini
+ date: 2024-12-21
+ versions: 0.69.2.dev
+ seconds_per_case: 34.3
+ total_cost: 17.6270
\ No newline at end of file
diff --git a/aider/website/_posts/2024-12-21-polyglot.md b/aider/website/_posts/2024-12-21-polyglot.md
index 639b8041d..8218631b9 100644
--- a/aider/website/_posts/2024-12-21-polyglot.md
+++ b/aider/website/_posts/2024-12-21-polyglot.md
@@ -1,5 +1,6 @@
---
-excerpt: TBD
+title: o1 tops new aider polyglot leaderboard
+excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
highlight_image: /assets/polyglot.jpg
draft: false
nav_exclude: true
@@ -67,10 +68,9 @@ The main goal for a new benchmark
was to re-calibrate the scale so that
today's top coding LLMs
would occupy a wide range of scores between about 5% and 50%.
-A 50% top score from today's best models
-should leave lots of headroom for future LLMs.
-And by spreading models across a wide 5-50% range, we
-can more clearly compare relative performance.
+This should leave headroom for future LLMs and
+make it possible to
+more clearly compare the relative performance of top models.
## Designing the polyglot benchmark
@@ -91,9 +91,6 @@ from 6 of the most popular programming languages:
- Rust
Exercism provides a total of 697 coding problems in those 6 languages.
-Although many of them are adaptations of the same conceptual problem,
-just ported into the different languages.
-
A set of 7 of today's top coding models each attempted all 697 of
the Exercism problems:
@@ -105,9 +102,9 @@ the Exercism problems:
- Qwen 32B Coder Instruct
- GPT-4o Mini
-Based on their results,
-the 697 coding problems were sorted by how many
-solutions were found to each problem:
+Depending on the difficulty of the problems,
+a different number of solutions were found by the collection of
+7 models:
| Solutions
found | Number of
problems | Cumulative number
of problems |
|--------|-----------|------------|
@@ -122,8 +119,8 @@ solutions were found to each problem:
In the table above, you can see that 258 of the problems were solved
by all 7 LLMs.
-These are far too easy, and wouldn't be good choices for the new benchmark.
-Instead, we need the hard problems like the
+These problems are far too easy, and wouldn't be good choices for the new benchmark.
+Instead, we need hard problems like the
66 that none of the 7 models were able to solve.
The new benchmark uses
@@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages:
-| Language | Hard Set |
+| Language | Problems |
|-------------|----------|
| C++ | 26 |
| Go | 39 |
@@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it
will be interesting to see
how long it will take for this new benchmark to saturate.
+## Benchmark problems
+
+The 225 coding problems are available in the
+[aider polyglot benchmark repo]()
+on GitHub.
+
+
## Results