mirror of https://github.com/Aider-AI/aider.git synced 2025-06-02 18:54:59 +00:00

Paul Gauthier 573a6814b2 copy

2024-03-08 08:09:07 -08:00

3.8 KiB

Raw Blame History

title	excerpt	highlight_image
Claude 3 beats GPT-4 on Aider code editing benchmark	Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.	/assets/2024-03-07-claude-3.svg

Claude 3 beats GPT-4 on Aider code editing benchmark

Anthropic just released their new Claude 3 models with evals showing better performance on coding tasks. With that in mind, I've been benchmarking the new models using Aider's code editing benchmark suite. Claude 3 Opus outperforms all of OpenAI's models, making it the best available model for pair programming with AI.

Aider currently supports Claude 3 Opus via OpenRouter:

# Install aider
pip install aider-chat

# Setup OpenRouter access
export OPENAI_API_KEY=<your-openrouter-key>
export OPENAI_API_BASE=https://openrouter.ai/api/v1

# Run aider with Claude 3 Opus using the diff editing format
aider --model anthropic/claude-3-opus --edit-format diff

Aider's code editing benchmark

Aider is an open source command line chat tool that lets you pair program with AI on code in your local git repo.

Aider relies on a code editing benchmark to quantitatively evaluate how well an LLM can make changes to existing code. The benchmark uses aider to try and complete 133 Exercism Python coding exercises. For each exercise, Exercism provides a starting python file with stubs for the needed functions, a natural language description of the problem to solve and a test suite to evaluate whether the coder has correctly solved the problem.

The LLM gets two tries to solve each problem:

On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
If the tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.

Benchmark results

Claude 3 Opus

The new claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
Its single-try performance was comparable to the latest GPT-4 Turbo model gpt-4-0125-preview, at 54.1%.
While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.

Claude 3 Sonnet

The new claude-3-sonnet-20240229 model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.

Other observations

There are a few other things worth noting:

Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the beer song program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand.

3.8 KiB Raw Blame History