copy

2025-05-29 08:44:59 +00:00 · 2024-12-22 08:43:59 -05:00 · 2024-12-22 08:43:59 -05:00 · c895e99306
commit c895e99306
parent 6d7e8beaaa
2 changed files with 45 additions and 15 deletions
--- a/aider/website/_data/edit_leaderboard.yml
+++ b/aider/website/_data/edit_leaderboard.yml
@ -2203,4 +2203,30 @@
  date: 2024-12-18
  versions: 0.69.2.dev
  seconds_per_case: 29.9
-  total_cost: 0.0000
+  total_cost: 0.0000
 - dirname: 2024-12-21-22-06-01--polyglot-o1-mini-whole
  test_cases: 225
  model: o1-mini-2024-09-12
  edit_format: whole
  commit_hash: a755079-dirty
  pass_rate_1: 8.9
  pass_rate_2: 27.1
  pass_num_1: 20
  pass_num_2: 61
  percent_cases_well_formed: 95.6
  error_outputs: 15
  num_malformed_responses: 14
  num_with_malformed_responses: 10
  user_asks: 37
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 5
  total_tests: 225
  command: aider --model o1-mini
  date: 2024-12-21
  versions: 0.69.2.dev
  seconds_per_case: 34.3
  total_cost: 17.6270
--- a/aider/website/_posts/2024-12-21-polyglot.md
+++ b/aider/website/_posts/2024-12-21-polyglot.md
@ -1,5 +1,6 @@
 ---
-excerpt: TBD
+title: o1 tops new aider polyglot leaderboard
 excerpt: o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
 highlight_image: /assets/polyglot.jpg
 draft: false
 nav_exclude: true
@ -67,10 +68,9 @@ The main goal for a new benchmark
 was to re-calibrate the scale so that
 today's top coding LLMs 
 would occupy a wide range of scores between about 5% and 50%.
-A 50% top score from today's best models
+This should leave headroom for future LLMs and
-should leave lots of headroom for future LLMs.
+make it possible to
-And by spreading models across a wide 5-50% range, we
+more clearly compare the relative performance of top models.
 can more clearly compare relative performance.
 ## Designing the polyglot benchmark
@ -91,9 +91,6 @@ from 6 of the most popular programming languages:
 - Rust
 Exercism provides a total of 697 coding problems in those 6 languages.
 Although many of them are adaptations of the same conceptual problem,
 just ported into the different languages.
 A set of 7 of today's top coding models each attempted all 697 of
 the Exercism problems:
@ -105,9 +102,9 @@ the Exercism problems:
 - Qwen 32B Coder Instruct
 - GPT-4o Mini
-Based on their results, 
+Depending on the difficulty of the problems,
-the 697 coding problems were sorted by how many 
+a different number of solutions were found by the collection of
-solutions were found to each problem:
+7 models:
 | Solutions<br>found | Number of<br>problems | Cumulative number<br>of problems |
 |--------|-----------|------------|
@ -122,8 +119,8 @@ solutions were found to each problem:
 In the table above, you can see that 258 of the problems were solved
 by all 7 LLMs.
-These are far too easy, and wouldn't be good choices for the new benchmark.
+These problems are far too easy, and wouldn't be good choices for the new benchmark.
-Instead, we need the hard problems like the
+Instead, we need hard problems like the
 66 that none of the 7 models were able to solve.
 The new benchmark uses 
@ -132,7 +129,7 @@ This achieves a balance between hard and moderate problems,
 and provides a large but not excessive total pool of problems.
 It also represents a good diversity of coding languages:
-| Language    | Hard Set |
+| Language    | Problems |
 |-------------|----------|
 | C++         | 26       |
 | Go          | 39       |
@ -152,6 +149,13 @@ Given the incredible pace of recent advancements, it
 will be interesting to see
 how long it will take for this new benchmark to saturate.
 ## Benchmark problems
 The 225 coding problems are available in the
 [aider polyglot benchmark repo]()
 on GitHub.
 ## Results