From ae768054b5c056d7e8efd99b3c7b6de35e14aeba Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Fri, 8 Mar 2024 08:20:25 -0800
Subject: [PATCH] copy

---
 _posts/2024-03-08-claude-3.md | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/_posts/2024-03-08-claude-3.md b/_posts/2024-03-08-claude-3.md
index a1c0f29f3..5159c369f 100644
--- a/_posts/2024-03-08-claude-3.md
+++ b/_posts/2024-03-08-claude-3.md
@@ -1,9 +1,9 @@
 ---
-title: Claude 3 beats GPT-4 on Aider code editing benchmark
+title: Claude 3 Opus beats GPT-4 on Aider code editing benchmark
 excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
 highlight_image: /assets/2024-03-07-claude-3.svg
 ---
-# Claude 3 beats GPT-4 on Aider code editing benchmark
+# Claude 3 Opus beats GPT-4 on Aider code editing benchmark
 
 [![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg)
 
@@ -49,7 +49,7 @@ and a test suite to evaluate whether the coder has correctly solved the problem.
 The LLM gets two tries to solve each problem:
 
 1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
-2. If the tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
+2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
 
 ## Benchmark results
 
@@ -63,15 +63,32 @@ The LLM gets two tries to solve each problem:
 
 - The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
 
+## Code editing
+
+It's highly desirable to have the LLM send back code edits as
+some form of diffs, rather than having it send back an updated copy of the
+entire source code.
+
+Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back
+updated copies of entire source files.
+Aider uses more efficient
+[search/replace blocks](https://aider.chat/2023/07/02/benchmarks.html#diff)
+with the original GPT-4
+and
+[unified diffs](https://aider.chat/2023/12/21/unified-diffs.html#unified-diff-editing-format)
+with the newer GPT-4 Turbo models.
+
+Claude 3 Opus works best with the search/replace blocks, allowing it to send back
+code changes efficiently.
+Unfortunately, the Sonnet model was only able to work reliably with whole files,
+which limits it to editing smaller source files and uses more tokens, money and time.
+
 ## Other observations
 
 There are a few other things worth noting:
 
 - Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
 - Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
-- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
-- The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand.
-
-
-
+- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
+- The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it's a sign that Anthropic made be struggling under surging demand.