This commit is contained in:
Paul Gauthier 2024-03-08 08:09:07 -08:00
parent 7fbd9e2be4
commit 573a6814b2
2 changed files with 12 additions and 10 deletions

View file

@ -1,5 +1,5 @@
--- ---
title: Claude 3 beats all OpenAI models on Aider code editing benchmark title: Claude 3 beats GPT-4 on Aider code editing benchmark
excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI. excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
highlight_image: /assets/2024-03-07-claude-3.svg highlight_image: /assets/2024-03-07-claude-3.svg
--- ---
@ -7,7 +7,7 @@ highlight_image: /assets/2024-03-07-claude-3.svg
[![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg) [![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg)
[Anthropic just release their new Claude 3 models]() [Anthropic just released their new Claude 3 models](https://www.anthropic.com/news/claude-3-family)
with evals showing better performance on coding tasks. with evals showing better performance on coding tasks.
With that in mind, I've been benchmarking the new models With that in mind, I've been benchmarking the new models
using Aider's code editing benchmark suite. using Aider's code editing benchmark suite.
@ -18,12 +18,12 @@ Aider currently supports Claude 3 Opus via
[OpenRouter](https://aider.chat/docs/faq.html#accessing-other-llms-with-openrouter): [OpenRouter](https://aider.chat/docs/faq.html#accessing-other-llms-with-openrouter):
``` ```
# Install Aider # Install aider
pip install aider-chat pip install aider-chat
# Setup openrouter access # Setup OpenRouter access
export OPENAI_API_KEY=<your-openrouter-key> export OPENAI_API_KEY=<your-openrouter-key>
export export OPENAI_API_BASE=https://openrouter.ai/api/v1 export OPENAI_API_BASE=https://openrouter.ai/api/v1
# Run aider with Claude 3 Opus using the diff editing format # Run aider with Claude 3 Opus using the diff editing format
aider --model anthropic/claude-3-opus --edit-format diff aider --model anthropic/claude-3-opus --edit-format diff
@ -56,7 +56,8 @@ The LLM gets two tries to solve each problem:
### Claude 3 Opus ### Claude 3 Opus
- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries. - The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
- It's single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%. - Its single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%.
- While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
### Claude 3 Sonnet ### Claude 3 Sonnet
@ -67,7 +68,8 @@ The LLM gets two tries to solve each problem:
There are a few other things worth noting: There are a few other things worth noting:
- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models. - Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song]() program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons. - Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
- The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand. - The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand.

View file

@ -79,12 +79,12 @@ which contains many benchmarking articles.
To access the OpenRouter models, simply: To access the OpenRouter models, simply:
``` ```
# Install Aider # Install aider
pip install aider-chat pip install aider-chat
# Setup openrouter access # Setup OpenRouter access
export OPENAI_API_KEY=<your-openrouter-key> export OPENAI_API_KEY=<your-openrouter-key>
export export OPENAI_API_BASE=https://openrouter.ai/api/v1 export OPENAI_API_BASE=https://openrouter.ai/api/v1
# For example, run aider with Claude 3 Opus using the diff editing format # For example, run aider with Claude 3 Opus using the diff editing format
aider --model anthropic/claude-3-opus --edit-format diff aider --model anthropic/claude-3-opus --edit-format diff