diff --git a/aider/website/_data/polyglot_leaderboard.yml b/aider/website/_data/polyglot_leaderboard.yml index c2024d1dc..9badd7a85 100644 --- a/aider/website/_data/polyglot_leaderboard.yml +++ b/aider/website/_data/polyglot_leaderboard.yml @@ -78,7 +78,7 @@ - dirname: 2024-12-21-19-23-03--polyglot-o1-hard-diff test_cases: 224 - model: o1-2024-12-17 + model: o1-2024-12-17 (high) edit_format: diff commit_hash: a755079-dirty pass_rate_1: 23.7 diff --git a/benchmark/README.md b/benchmark/README.md index 6b20c3797..b9e1b1e43 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -2,18 +2,18 @@ # Aider benchmark harness Aider uses benchmarks to quantitatively measure how well it works -various LLMs. +with various LLMs. This directory holds the harness and tools needed to run the benchmarking suite. ## Background The benchmark is based on the [Exercism](https://github.com/exercism/python) coding exercises. This -benchmark evaluates how effectively aider and GPT can translate a +benchmark evaluates how effectively aider and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just -GPT's coding ability, but also its capacity to *edit existing code* +the LLM's coding ability, but also its capacity to *edit existing code* and *format those code edits* so that aider can save the edits to the local source files.