From 7fbd9e2be423283080c0ac8dbdc80ebe8a508ed4 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Fri, 8 Mar 2024 08:00:41 -0800 Subject: [PATCH] Added claude post --- _posts/2024-03-08-claude-3.md | 75 ++ aider/coders/base_coder.py | 4 + assets/2024-03-07-claude-3.svg | 2031 ++++++++++++++++++++++++++++++++ docs/faq.md | 20 +- 4 files changed, 2123 insertions(+), 7 deletions(-) create mode 100644 _posts/2024-03-08-claude-3.md create mode 100644 assets/2024-03-07-claude-3.svg diff --git a/_posts/2024-03-08-claude-3.md b/_posts/2024-03-08-claude-3.md new file mode 100644 index 000000000..9bcc7cd2e --- /dev/null +++ b/_posts/2024-03-08-claude-3.md @@ -0,0 +1,75 @@ +--- +title: Claude 3 beats all OpenAI models on Aider code editing benchmark +excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI. +highlight_image: /assets/2024-03-07-claude-3.svg +--- +# Claude 3 beats GPT-4 on Aider code editing benchmark + +[![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg) + +[Anthropic just release their new Claude 3 models]() +with evals showing better performance on coding tasks. +With that in mind, I've been benchmarking the new models +using Aider's code editing benchmark suite. +Claude 3 Opus outperforms all of OpenAI's models, +making it the best available model for pair programming with AI. + +Aider currently supports Claude 3 Opus via +[OpenRouter](https://aider.chat/docs/faq.html#accessing-other-llms-with-openrouter): + +``` +# Install Aider +pip install aider-chat + +# Setup openrouter access +export OPENAI_API_KEY= +export export OPENAI_API_BASE=https://openrouter.ai/api/v1 + +# Run aider with Claude 3 Opus using the diff editing format +aider --model anthropic/claude-3-opus --edit-format diff +``` + +## Aider's code editing benchmark + +[Aider](https://github.com/paul-gauthier/aider) +is an open source command line chat tool that lets you +pair program with AI on code in your local git repo. + +Aider relies on a +[code editing benchmark](https://aider.chat/docs/benchmarks.html) +to quantitatively evaluate how well +an LLM can make changes to existing code. +The benchmark uses aider to try and complete +[133 Exercism Python coding exercises](https://github.com/exercism/python). +For each exercise, +Exercism provides a starting python file with stubs for the needed functions, +a natural language description of the problem to solve +and a test suite to evaluate whether the coder has correctly solved the problem. + +The LLM gets two tries to solve each problem: + +1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done. +2. If the tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task. + +## Benchmark results + +### Claude 3 Opus + +- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries. +- It's single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%. + +### Claude 3 Sonnet + +- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%. + +## Other observations + +There are a few other things worth noting: + +- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models. +- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song]() program, which at makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons. +- The Claude API's seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider does exponential backoff retries in these cases, but it's a sign that they made be struggling under surging demand. + + + + diff --git a/aider/coders/base_coder.py b/aider/coders/base_coder.py index 8834822b8..60067f8d7 100755 --- a/aider/coders/base_coder.py +++ b/aider/coders/base_coder.py @@ -682,6 +682,10 @@ class Coder: if self.verbose: print(completion) + if not completion.choices: + self.io.tool_error(str(completion)) + return + show_func_err = None show_content_err = None try: diff --git a/assets/2024-03-07-claude-3.svg b/assets/2024-03-07-claude-3.svg new file mode 100644 index 000000000..298b6b322 --- /dev/null +++ b/assets/2024-03-07-claude-3.svg @@ -0,0 +1,2031 @@ + + + + + + + + 2024-03-07T12:50:41.385323 + image/svg+xml + + + Matplotlib v3.8.2, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/faq.md b/docs/faq.md index 831e01704..d6a52f127 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -74,16 +74,22 @@ which contains many benchmarking articles. ## Accessing other LLMs with OpenRouter -[OpenRouter](https://openrouter.ai) provide an interface to [many models](https://openrouter.ai/docs) which are not widely accessible, in particular gpt-4-32k and claude-2. +[OpenRouter](https://openrouter.ai) provide an interface to [many models](https://openrouter.ai/models) which are not widely accessible, in particular Claude 3 Opus. -To access the openrouter models simply +To access the OpenRouter models, simply: -- register for an account, purchase some credits and generate an api key -- set `--openai-api-base https://openrouter.ai/api/v1` -- set `--openai-api-key` to your openrouter key -- set `--model` to the model of your choice (`openai/gpt-4-32k`, `anthropic/claude-2` etc.) +``` +# Install Aider +pip install aider-chat + +# Setup openrouter access +export OPENAI_API_KEY= +export export OPENAI_API_BASE=https://openrouter.ai/api/v1 + +# For example, run aider with Claude 3 Opus using the diff editing format +aider --model anthropic/claude-3-opus --edit-format diff +``` -Some of the models weren't very functional and each llm has its own quirks. The anthropic models work ok, but the llama-2 ones in particular will need more work to play friendly with aider. ## Can I use aider with other LLMs, local LLMs, etc?