Added claude post

2025-05-31 17:55:01 +00:00 · 2024-03-08 08:00:41 -08:00 · 2024-03-08 08:00:41 -08:00 · 7fbd9e2be4
commit 7fbd9e2be4
parent a1939f50e4
4 changed files with 2123 additions and 7 deletions
--- a/_posts/2024-03-08-claude-3.md
+++ b/_posts/2024-03-08-claude-3.md
@ -0,0 +1,75 @@
+---
+title: Claude 3 beats all OpenAI models on Aider code editing benchmark
+excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
+highlight_image: /assets/2024-03-07-claude-3.svg
+---
+# Claude 3 beats GPT-4 on Aider code editing benchmark
+
+[![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg)
+
+[Anthropic just release their new Claude 3 models]()
+with evals showing better performance on coding tasks.
+With that in mind, I've been benchmarking the new models
+using Aider's code editing benchmark suite.
+Claude 3 Opus outperforms all of OpenAI's models,
+making it the best available model for pair programming with AI.
+
+Aider currently supports Claude 3 Opus via
+[OpenRouter](https://aider.chat/docs/faq.html#accessing-other-llms-with-openrouter):
+
+```
+# Install Aider
+pip install aider-chat
+
+# Setup openrouter access
+export OPENAI_API_KEY=<your-openrouter-key>
+export export OPENAI_API_BASE=https://openrouter.ai/api/v1
+
+# Run aider with Claude 3 Opus using the diff editing format
+aider --model anthropic/claude-3-opus --edit-format diff
+```
+
+## Aider's code editing benchmark
+
+[Aider](https://github.com/paul-gauthier/aider)
+is an open source command line chat tool that lets you
+pair program with AI on code in your local git repo.
+
+Aider relies on a
+[code editing benchmark](https://aider.chat/docs/benchmarks.html)
+to quantitatively evaluate how well
+an LLM can make changes to existing code.
+The benchmark uses aider to try and complete
+[133 Exercism Python coding exercises](https://github.com/exercism/python).
+For each exercise,
+Exercism provides a starting python file with stubs for the needed functions,
+a natural language description of the problem to solve
+and a test suite to evaluate whether the coder has correctly solved the problem.
+
+The LLM gets two tries to solve each problem:
+
+1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
+2. If the tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
+
+## Benchmark results
+
+### Claude 3 Opus
+
+- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
+- It's single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%.
+
+### Claude 3 Sonnet
+
+- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
+
+## Other observations
+
+There are a few other things worth noting:
+
+- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
+- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song]() program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
+- The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand.
+
+
+
+