From b74edcf3507da166274066a9605fedd60950868d Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Sat, 4 May 2024 11:19:32 -0700 Subject: [PATCH] copy --- docs/leaderboard.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/leaderboard.md b/docs/leaderboard.md index 1dd6e1ce0..4d05151dc 100644 --- a/docs/leaderboard.md +++ b/docs/leaderboard.md @@ -9,10 +9,10 @@ so the LLM needs to be capable of reliably specifying how to edit code. Aider uses two benchmarks to measure an LLM's code editing ability: -- The [code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark) asks the LLM to edit python source files to complete 133 Exercism exercises. -- The [refactoring benchmark](https://github.com/paul-gauthier/refactor-benchmark) asks the LLM to refactor large methods from a large python source file. This is a more challenging benchmark, which tests the model's ability to output long chunks of code without skipping sections. +- The [code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark) asks the LLM to edit python source files to complete 133 Exercism exercises. This benchmark measures the LLM's ability to emit code edits according to the format aider specifies in the system prompt. +- The [refactoring benchmark](https://github.com/paul-gauthier/refactor-benchmark) asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model's ability to output long chunks of code without skipping sections. It was developed to provoke and measure GPT-4 Turbo's "lazy coding" habit. -These leaderboards report the results from a number of popular LLMs, +The leaderboards below report the results from a number of popular LLMs, to help users select which models to use with aider. While [aider can connect to almost any LLM](https://aider.chat/docs/llms.html) it will work best with models that score well on the benchmarks. @@ -83,6 +83,10 @@ it will work best with models that score well on the benchmarks. ## Code refactoring leaderboard +The refactoring benchmark requires a large context window to +work with large source files. +Therefore, results are available for fewer models. +