From 87a964355b13b6fde6eb9ff2a75fcd8d17b62ae4 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Mon, 23 Dec 2024 08:00:25 -0500 Subject: [PATCH] copy --- aider/website/_data/polyglot_leaderboard.yml | 2 +- benchmark/README.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/aider/website/_data/polyglot_leaderboard.yml b/aider/website/_data/polyglot_leaderboard.yml index c2024d1dc..9badd7a85 100644 --- a/aider/website/_data/polyglot_leaderboard.yml +++ b/aider/website/_data/polyglot_leaderboard.yml @@ -78,7 +78,7 @@ - dirname: 2024-12-21-19-23-03--polyglot-o1-hard-diff test_cases: 224 - model: o1-2024-12-17 + model: o1-2024-12-17 (high) edit_format: diff commit_hash: a755079-dirty pass_rate_1: 23.7 diff --git a/benchmark/README.md b/benchmark/README.md index 6b20c3797..b9e1b1e43 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -2,18 +2,18 @@ # Aider benchmark harness Aider uses benchmarks to quantitatively measure how well it works -various LLMs. +with various LLMs. This directory holds the harness and tools needed to run the benchmarking suite. ## Background The benchmark is based on the [Exercism](https://github.com/exercism/python) coding exercises. This -benchmark evaluates how effectively aider and GPT can translate a +benchmark evaluates how effectively aider and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just -GPT's coding ability, but also its capacity to *edit existing code* +the LLM's coding ability, but also its capacity to *edit existing code* and *format those code edits* so that aider can save the edits to the local source files.