This commit is contained in:
Paul Gauthier 2024-03-08 08:20:25 -08:00
parent 573a6814b2
commit ae768054b5

View file

@ -1,9 +1,9 @@
---
title: Claude 3 beats GPT-4 on Aider code editing benchmark
title: Claude 3 Opus beats GPT-4 on Aider code editing benchmark
excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
highlight_image: /assets/2024-03-07-claude-3.svg
---
# Claude 3 beats GPT-4 on Aider code editing benchmark
# Claude 3 Opus beats GPT-4 on Aider code editing benchmark
[![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg)
@ -49,7 +49,7 @@ and a test suite to evaluate whether the coder has correctly solved the problem.
The LLM gets two tries to solve each problem:
1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
2. If the tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
## Benchmark results
@ -63,15 +63,32 @@ The LLM gets two tries to solve each problem:
- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
## Code editing
It's highly desirable to have the LLM send back code edits as
some form of diffs, rather than having it send back an updated copy of the
entire source code.
Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back
updated copies of entire source files.
Aider uses more efficient
[search/replace blocks](https://aider.chat/2023/07/02/benchmarks.html#diff)
with the original GPT-4
and
[unified diffs](https://aider.chat/2023/12/21/unified-diffs.html#unified-diff-editing-format)
with the newer GPT-4 Turbo models.
Claude 3 Opus works best with the search/replace blocks, allowing it to send back
code changes efficiently.
Unfortunately, the Sonnet model was only able to work reliably with whole files,
which limits it to editing smaller source files and uses more tokens, money and time.
## Other observations
There are a few other things worth noting:
- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
- Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
- The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand.
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
- The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it's a sign that Anthropic made be struggling under surging demand.