mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-10 06:34:59 +00:00
moved website/ -> aider/website/
This commit is contained in:
parent
eb80b32915
commit
22a494bb59
155 changed files with 9 additions and 9 deletions
1
aider/website/_posts/2023-05-25-ctags.md
Symbolic link
1
aider/website/_posts/2023-05-25-ctags.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/ctags.md
|
1
aider/website/_posts/2023-07-02-benchmarks.md
Symbolic link
1
aider/website/_posts/2023-07-02-benchmarks.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks.md
|
268
aider/website/_posts/2023-10-22-repomap.md
Normal file
268
aider/website/_posts/2023-10-22-repomap.md
Normal file
|
@ -0,0 +1,268 @@
|
|||
---
|
||||
title: Building a better repository map with tree sitter
|
||||
excerpt: Tree-sitter allows aider to build a repo map that better summarizes large code bases.
|
||||
highlight_image: /assets/robot-ast.png
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Building a better repository map with tree sitter
|
||||
|
||||

|
||||
|
||||
GPT-4 is extremely useful for "self-contained" coding tasks,
|
||||
like generating or modifying a simple function
|
||||
that has no dependencies. Tools like GitHub CoPilot serve
|
||||
these simple coding tasks well.
|
||||
|
||||
But making complex changes in a larger, pre-existing codebase
|
||||
is much more difficult, for both humans and AIs.
|
||||
To do this successfully, you need to:
|
||||
|
||||
1. Find the code that needs to be changed.
|
||||
2. Understand how that code relates to the rest of the codebase.
|
||||
3. Make the correct code change to accomplish the task.
|
||||
|
||||
GPT-4 is actually great at making the code changes (3),
|
||||
once you tell it which files need to be changed (1)
|
||||
and show it how they fit into the rest of the codebase (2).
|
||||
|
||||
This article is going to focus on step (2), providing "code context":
|
||||
|
||||
- We need to help GPT understand the overall codebase.
|
||||
- This will help it understand the code it needs to change, which may depend on other parts of the codebase.
|
||||
- It will also help GPT write new code and modify the existing code in a way
|
||||
that respects and utilizes existing libraries, modules and abstractions
|
||||
found elsewhere in the codebase.
|
||||
- We must convey all of this "code context" to GPT in an
|
||||
efficient manner that fits within the limited context window.
|
||||
|
||||
To address these issues, aider
|
||||
sends GPT a **concise map of your whole git repository**
|
||||
that includes
|
||||
the most important classes and functions along with their types and call signatures.
|
||||
|
||||
This **repository map** is now built automatically using
|
||||
[tree-sitter](https://tree-sitter.github.io/tree-sitter/)
|
||||
to extract symbol definitions from source files.
|
||||
Tree-sitter is used by many IDEs, editors and LSP servers to
|
||||
help humans search and navigate large codebases.
|
||||
Aider now uses it to help GPT better comprehend, navigate
|
||||
and edit code in larger repos.
|
||||
|
||||
*To code with GPT-4 using the techniques discussed here, just install [aider](https://aider.chat/docs/install.html).*
|
||||
|
||||
|
||||
## The problem: code context
|
||||
|
||||
GPT-4 is great at "self contained" coding tasks, like writing or
|
||||
modifying a pure function with no external dependencies.
|
||||
GPT can easily handle requests like "write a
|
||||
Fibonacci function" or "rewrite this loop using list
|
||||
comprehensions", because they require no context beyond the code
|
||||
being discussed.
|
||||
|
||||
Most real code is not pure and self-contained, it is intertwined with
|
||||
and depends on code from many different files in a repo.
|
||||
If you ask GPT to "switch all the print statements in class Foo to
|
||||
use the BarLog logging system", it needs to see and
|
||||
modify the code in the Foo class, but it also needs to understand
|
||||
how to use
|
||||
the project's BarLog
|
||||
subsystem.
|
||||
|
||||
A simple solution is to **send the entire codebase** to GPT along with
|
||||
each change request. Now GPT has all the context! But this won't work
|
||||
for even moderately
|
||||
sized repos, because they won't fit into the context window.
|
||||
|
||||
A better approach is to be selective,
|
||||
and **hand pick which files to send**.
|
||||
For the example above, you could send the file that
|
||||
contains the Foo class
|
||||
and the file that contains the BarLog logging subsystem.
|
||||
This works pretty well, and is supported by aider -- you
|
||||
can manually specify which files to "add to the chat" you are having with GPT.
|
||||
|
||||
But sending whole files is a bulky way to send code context,
|
||||
wasting the precious context window.
|
||||
GPT doesn't need to see the entire implementation of BarLog,
|
||||
it just needs to understand it well enough to use it.
|
||||
You may quickly run out of context window by sending
|
||||
full files of code
|
||||
just to convey context.
|
||||
|
||||
Aider also strives to reduce the manual work involved in
|
||||
coding with AI.
|
||||
So in an ideal world, we'd like aider to automatically
|
||||
identify and provide the needed code context.
|
||||
|
||||
## Using a repo map to provide context
|
||||
|
||||
Aider sends a **repo map** to GPT along with
|
||||
each request from the user to make a code change.
|
||||
The map contains a list of the files in the
|
||||
repo, along with the key symbols which are defined in each file.
|
||||
It shows how each of these symbols are defined in the
|
||||
source code, by including the critical lines of code for each definition.
|
||||
|
||||
Here's a
|
||||
sample of the map of the aider repo, just showing the maps of
|
||||
[base_coder.py](https://github.com/paul-gauthier/aider/blob/main/aider/coders/base_coder.py)
|
||||
and
|
||||
[commands.py](https://github.com/paul-gauthier/aider/blob/main/aider/commands.py)
|
||||
:
|
||||
|
||||
```
|
||||
aider/coders/base_coder.py:
|
||||
⋮...
|
||||
│class Coder:
|
||||
│ abs_fnames = None
|
||||
⋮...
|
||||
│ @classmethod
|
||||
│ def create(
|
||||
│ self,
|
||||
│ main_model,
|
||||
│ edit_format,
|
||||
│ io,
|
||||
│ skip_model_availabily_check=False,
|
||||
│ **kwargs,
|
||||
⋮...
|
||||
│ def abs_root_path(self, path):
|
||||
⋮...
|
||||
│ def run(self, with_message=None):
|
||||
⋮...
|
||||
|
||||
aider/commands.py:
|
||||
⋮...
|
||||
│class Commands:
|
||||
│ voice = None
|
||||
│
|
||||
⋮...
|
||||
│ def get_commands(self):
|
||||
⋮...
|
||||
│ def get_command_completions(self, cmd_name, partial):
|
||||
⋮...
|
||||
│ def run(self, inp):
|
||||
⋮...
|
||||
```
|
||||
|
||||
Mapping out the repo like this provides some key benefits:
|
||||
|
||||
- GPT can see classes, methods and function signatures from everywhere in the repo. This alone may give it enough context to solve many tasks. For example, it can probably figure out how to use the API exported from a module just based on the details shown in the map.
|
||||
- If it needs to see more code, GPT can use the map to figure out by itself which files it needs to look at in more detail. GPT will then ask to see these specific files, and aider will automatically add them to the chat context.
|
||||
|
||||
## Optimizing the map
|
||||
|
||||
Of course, for large repositories even just the repo map might be too large
|
||||
for GPT's context window.
|
||||
Aider solves this problem by sending just the **most relevant**
|
||||
portions of the repo map.
|
||||
It does this by analyzing the full repo map using
|
||||
a graph ranking algorithm, computed on a graph
|
||||
where each source file is a node and edges connect
|
||||
files which have dependencies.
|
||||
Aider optimizes the repo map by
|
||||
selecting the most important parts of the codebase
|
||||
which will
|
||||
fit into the token budget assigned by the user
|
||||
(via the `--map-tokens` switch, which defaults to 1k tokens).
|
||||
|
||||
The sample map shown above doesn't contain *every* class, method and function from those
|
||||
files.
|
||||
It only includes the most important identifiers,
|
||||
the ones which are most often referenced by other portions of the code.
|
||||
These are the key pieces of context that GPT needs to know to understand
|
||||
the overall codebase.
|
||||
|
||||
|
||||
## Using tree-sitter to make the map
|
||||
|
||||
Under the hood, aider uses
|
||||
[tree sitter](https://tree-sitter.github.io/tree-sitter/)
|
||||
to build the
|
||||
map.
|
||||
It specifically uses the
|
||||
[py-tree-sitter-languages](https://github.com/grantjenks/py-tree-sitter-languages)
|
||||
python module,
|
||||
which provides simple, pip-installable binary wheels for
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
|
||||
Tree-sitter parses source code into an Abstract Syntax Tree (AST) based
|
||||
on the syntax of the programming language.
|
||||
Using the AST, we can identify where functions, classes, variables, types and
|
||||
other definitions occur in the source code.
|
||||
We can also identify where else in the code these things are used or referenced.
|
||||
|
||||
Aider uses all of these definitions and references to
|
||||
determine which are the most important identifiers in the repository,
|
||||
and to produce the repo map that shows just those key
|
||||
lines from the codebase.
|
||||
|
||||
## What about ctags?
|
||||
|
||||
The tree-sitter repository map replaces the
|
||||
[ctags based map](https://aider.chat/docs/ctags.html)
|
||||
that aider originally used.
|
||||
Switching from ctags to tree-sitter provides a bunch of benefits:
|
||||
|
||||
- The map is richer, showing full function call signatures and other details straight from the source files.
|
||||
- Thanks to `py-tree-sitter-languages`, we get full support for many programming languages via a python package that's automatically installed as part of the normal `pip install aider-chat`.
|
||||
- We remove the requirement for users to manually install `universal-ctags` via some external tool or package manager (brew, apt, choco, etc).
|
||||
- Tree-sitter integration is a key enabler for future work and capabilities for aider.
|
||||
|
||||
## Future work
|
||||
|
||||
You'll recall that we identified the 3 key steps
|
||||
required to use GPT
|
||||
to complete a coding task within a large, pre-existing codebase:
|
||||
|
||||
1. Find the code that needs to be changed.
|
||||
2. Understand how that code relates to the rest of the codebase.
|
||||
3. Make the correct code change to accomplish the task.
|
||||
|
||||
We're now using tree-sitter to help solve the code context problem (2),
|
||||
but it's also an important foundation
|
||||
for future work on automatically finding all the code which
|
||||
will need to be changed (1).
|
||||
|
||||
Right now, aider relies on the user to specify which source files
|
||||
will need to be modified to complete their request.
|
||||
Users manually "add files to the chat" using aider's `/add` command,
|
||||
which makes those files available for GPT to modify.
|
||||
|
||||
This works well, but a key piece of future work is to harness the
|
||||
power of GPT and tree-sitter to automatically identify
|
||||
which parts of the code will need changes.
|
||||
|
||||
## Try it out
|
||||
|
||||
To code with GPT-4 using the techniques discussed here,
|
||||
just install [aider](https://aider.chat/docs/install.html).
|
||||
|
||||
## Credits
|
||||
|
||||
Aider uses
|
||||
[modified versions of the tags.scm files](https://github.com/paul-gauthier/aider/tree/main/aider/queries)
|
||||
from these
|
||||
open source tree-sitter language implementations:
|
||||
|
||||
* [https://github.com/tree-sitter/tree-sitter-c](https://github.com/tree-sitter/tree-sitter-c) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-c-sharp](https://github.com/tree-sitter/tree-sitter-c-sharp) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-cpp](https://github.com/tree-sitter/tree-sitter-cpp) — licensed under the MIT License.
|
||||
* [https://github.com/Wilfred/tree-sitter-elisp](https://github.com/Wilfred/tree-sitter-elisp) — licensed under the MIT License.
|
||||
* [https://github.com/elixir-lang/tree-sitter-elixir](https://github.com/elixir-lang/tree-sitter-elixir) — licensed under the Apache License, Version 2.0.
|
||||
* [https://github.com/elm-tooling/tree-sitter-elm](https://github.com/elm-tooling/tree-sitter-elm) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-go](https://github.com/tree-sitter/tree-sitter-go) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-java](https://github.com/tree-sitter/tree-sitter-java) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-javascript](https://github.com/tree-sitter/tree-sitter-javascript) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-ocaml](https://github.com/tree-sitter/tree-sitter-ocaml) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-php](https://github.com/tree-sitter/tree-sitter-php) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-python](https://github.com/tree-sitter/tree-sitter-python) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-ql](https://github.com/tree-sitter/tree-sitter-ql) — licensed under the MIT License.
|
||||
* [https://github.com/r-lib/tree-sitter-r](https://github.com/r-lib/tree-sitter-r) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-ruby](https://github.com/tree-sitter/tree-sitter-ruby) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-rust](https://github.com/tree-sitter/tree-sitter-rust) — licensed under the MIT License.
|
||||
* [https://github.com/tree-sitter/tree-sitter-typescript](https://github.com/tree-sitter/tree-sitter-typescript) — licensed under the MIT License.
|
1
aider/website/_posts/2023-11-06-benchmarks-1106.md
Symbolic link
1
aider/website/_posts/2023-11-06-benchmarks-1106.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-1106.md
|
1
aider/website/_posts/2023-11-06-benchmarks-speed-1106.md
Symbolic link
1
aider/website/_posts/2023-11-06-benchmarks-speed-1106.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-speed-1106.md
|
1
aider/website/_posts/2023-12-21-unified-diffs.md
Symbolic link
1
aider/website/_posts/2023-12-21-unified-diffs.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/unified-diffs.md
|
1
aider/website/_posts/2024-01-25-benchmarks-0125.md
Symbolic link
1
aider/website/_posts/2024-01-25-benchmarks-0125.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-0125.md
|
93
aider/website/_posts/2024-03-08-claude-3.md
Normal file
93
aider/website/_posts/2024-03-08-claude-3.md
Normal file
|
@ -0,0 +1,93 @@
|
|||
---
|
||||
title: Claude 3 beats GPT-4 on Aider's code editing benchmark
|
||||
excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
|
||||
highlight_image: /assets/2024-03-07-claude-3.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Claude 3 beats GPT-4 on Aider's code editing benchmark
|
||||
|
||||
[](https://aider.chat/assets/2024-03-07-claude-3.svg)
|
||||
|
||||
[Anthropic just released their new Claude 3 models](https://www.anthropic.com/news/claude-3-family)
|
||||
with evals showing better performance on coding tasks.
|
||||
With that in mind, I've been benchmarking the new models
|
||||
using Aider's code editing benchmark suite.
|
||||
|
||||
Claude 3 Opus outperforms all of OpenAI's models,
|
||||
making it the best available model for pair programming with AI.
|
||||
|
||||
To use Claude 3 Opus with aider:
|
||||
|
||||
```
|
||||
pip install aider-chat
|
||||
export ANTHROPIC_API_KEY=sk-...
|
||||
aider --opus
|
||||
```
|
||||
|
||||
## Aider's code editing benchmark
|
||||
|
||||
[Aider](https://github.com/paul-gauthier/aider)
|
||||
is an open source command line chat tool that lets you
|
||||
pair program with AI on code in your local git repo.
|
||||
|
||||
Aider relies on a
|
||||
[code editing benchmark](https://aider.chat/docs/benchmarks.html)
|
||||
to quantitatively evaluate how well
|
||||
an LLM can make changes to existing code.
|
||||
The benchmark uses aider to try and complete
|
||||
[133 Exercism Python coding exercises](https://github.com/exercism/python).
|
||||
For each exercise,
|
||||
Exercism provides a starting python file with stubs for the needed functions,
|
||||
a natural language description of the problem to solve
|
||||
and a test suite to evaluate whether the coder has correctly solved the problem.
|
||||
|
||||
The LLM gets two tries to solve each problem:
|
||||
|
||||
1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
|
||||
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
|
||||
|
||||
## Benchmark results
|
||||
|
||||
### Claude 3 Opus
|
||||
|
||||
- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
|
||||
- Its single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%.
|
||||
- While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
|
||||
|
||||
### Claude 3 Sonnet
|
||||
|
||||
- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
|
||||
|
||||
## Code editing
|
||||
|
||||
It's highly desirable to have the LLM send back code edits as
|
||||
some form of diffs, rather than having it send back an updated copy of the
|
||||
entire source code.
|
||||
|
||||
Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back
|
||||
updated copies of entire source files.
|
||||
Aider uses more efficient
|
||||
[search/replace blocks](https://aider.chat/2023/07/02/benchmarks.html#diff)
|
||||
with the original GPT-4
|
||||
and
|
||||
[unified diffs](https://aider.chat/2023/12/21/unified-diffs.html#unified-diff-editing-format)
|
||||
with the newer GPT-4 Turbo models.
|
||||
|
||||
Claude 3 Opus works best with the search/replace blocks, allowing it to send back
|
||||
code changes efficiently.
|
||||
Unfortunately, the Sonnet model was only able to work reliably with whole files,
|
||||
which limits it to editing smaller source files and uses more tokens, money and time.
|
||||
|
||||
## Other observations
|
||||
|
||||
There are a few other things worth noting:
|
||||
|
||||
- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
|
||||
- Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
|
||||
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
|
||||
- The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it's a sign that Anthropic made be struggling under surging demand.
|
||||
|
74
aider/website/_posts/2024-04-09-gpt-4-turbo.md
Normal file
74
aider/website/_posts/2024-04-09-gpt-4-turbo.md
Normal file
|
@ -0,0 +1,74 @@
|
|||
---
|
||||
title: GPT-4 Turbo with Vision is a step backwards for coding
|
||||
excerpt: OpenAI's GPT-4 Turbo with Vision model scores worse on aider's code editing benchmarks than all the previous GPT-4 models. In particular, it seems much more prone to "lazy coding" than the existing GPT-4 Turbo "preview" models.
|
||||
highlight_image: /assets/2024-04-09-gpt-4-turbo-laziness.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# GPT-4 Turbo with Vision is a step backwards for coding
|
||||
|
||||
[OpenAI just released GPT-4 Turbo with Vision](https://twitter.com/OpenAIDevs/status/1777769463258988634)
|
||||
and it performs worse on aider's coding benchmark suites than all the previous GPT-4 models.
|
||||
In particular, it seems much more prone to "lazy coding" than the
|
||||
existing GPT-4 Turbo "preview" models.
|
||||
|
||||
## Code editing skill
|
||||
|
||||
[](https://aider.chat/assets/2024-04-09-gpt-4-turbo.svg)
|
||||
|
||||
Aider relies on a
|
||||
[code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark)
|
||||
to quantitatively evaluate how well
|
||||
an LLM can make changes to existing code.
|
||||
The benchmark uses aider to try and complete
|
||||
[133 Exercism Python coding exercises](https://github.com/exercism/python).
|
||||
|
||||
For each exercise, the LLM gets two tries to solve each problem:
|
||||
|
||||
1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
|
||||
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
|
||||
|
||||
**GPT-4 Turbo with Vision
|
||||
scores only 62% on this benchmark,
|
||||
the lowest score of any of the existing GPT-4 models.**
|
||||
The other models scored 63-66%, so this represents only a small
|
||||
regression, and is likely statistically insignificant when compared
|
||||
against `gpt-4-0613`.
|
||||
|
||||
## Lazy coding
|
||||
|
||||
[](https://aider.chat/assets/2024-04-09-gpt-4-turbo-laziness.svg)
|
||||
|
||||
The GPT-4 Turbo "preview" models have been widely criticized for being "lazy"
|
||||
when coding.
|
||||
They often omit needed code
|
||||
and instead leave comments with homework assignments like "implement method here".
|
||||
|
||||
```
|
||||
def some_complex_method(foo, bar):
|
||||
# ... implement method here ...
|
||||
```
|
||||
|
||||
Aider uses a ["laziness" benchmark suite](https://github.com/paul-gauthier/refactor-benchmark)
|
||||
which is designed to both provoke and quantify lazy coding.
|
||||
It consists of
|
||||
89 python refactoring tasks
|
||||
which tend to make GPT-4 Turbo code in that lazy manner.
|
||||
|
||||
**The new GPT-4 Turbo with Vision model scores only 34% on aider's
|
||||
refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
|
||||
by a significant margin.**
|
||||
|
||||
# Conclusions
|
||||
|
||||
Aider has full support for the new GPT-4 Turbo with Vision
|
||||
model, which you can access using the switch `--model gpt-4-turbo-2024-04-09`.
|
||||
But aider will continue to use `gpt-4-1106-preview` by default,
|
||||
as it is by far the strongest coder of the GPT-4 models.
|
||||
|
||||
|
||||
|
||||
|
55
aider/website/_posts/2024-05-02-browser.md
Normal file
55
aider/website/_posts/2024-05-02-browser.md
Normal file
|
@ -0,0 +1,55 @@
|
|||
---
|
||||
title: Aider in your browser
|
||||
excerpt: Aider has an experimental browser UI, allowing you to collaborate with LLMs on code in your local git repo.
|
||||
highlight_image: /assets/browser.jpg
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Aider in your browser
|
||||
|
||||
<div class="video-container">
|
||||
<video controls loop poster="/assets/browser.jpg">
|
||||
<source src="/assets/aider-browser-social.mp4" type="video/mp4">
|
||||
<a href="/assets/aider-browser-social.mp4">Aider browser UI demo video</a>
|
||||
</video>
|
||||
</div>
|
||||
|
||||
<style>
|
||||
.video-container {
|
||||
position: relative;
|
||||
padding-bottom: 101.89%; /* 1080 / 1060 = 1.0189 */
|
||||
height: 0;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.video-container video {
|
||||
position: absolute;
|
||||
top: 0;
|
||||
left: 0;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}
|
||||
</style>
|
||||
|
||||
Use aider's new experimental browser UI to collaborate with LLMs
|
||||
to edit code in your local git repo.
|
||||
Aider will directly edit the code in your local source files,
|
||||
and [git commit the changes](https://aider.chat/docs/git.html)
|
||||
with sensible commit messages.
|
||||
You can start a new project or work with an existing git repo.
|
||||
Aider works well with GPT 3.5, GPT-4, GPT-4 Turbo with Vision,
|
||||
and Claude 3 Opus.
|
||||
It also supports [connecting to almost any LLM](https://aider.chat/docs/llms.html).
|
||||
|
||||
Use the `--browser` switch to launch the browser version of aider:
|
||||
|
||||
```
|
||||
pip install aider-chat
|
||||
|
||||
export OPENAI_API_KEY=<key> # Mac/Linux
|
||||
setx OPENAI_API_KEY <key> # Windows
|
||||
|
||||
aider --browser
|
||||
```
|
327
aider/website/_posts/2024-05-13-models-over-time.md
Normal file
327
aider/website/_posts/2024-05-13-models-over-time.md
Normal file
|
@ -0,0 +1,327 @@
|
|||
---
|
||||
title: Drawing graphs with aider, GPT-4o and matplotlib
|
||||
excerpt: Use GPT-4o to draw graphs with matplotlib, including adjusting styles and making visual changes. You get the graph, but you also get the code in your repo.
|
||||
highlight_image: /assets/models-over-time.png
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||

|
||||
|
||||
# Drawing graphs with aider, GPT-4o and matplotlib
|
||||
|
||||
I recently wanted to draw a graph showing how LLM code editing skill has been
|
||||
changing over time as new models have been released by OpenAI, Anthropic and others.
|
||||
I have all the
|
||||
[data in a yaml file](https://github.com/paul-gauthier/aider/blob/main/website/_data/edit_leaderboard.yml) that is used to render
|
||||
[aider's LLM leaderboards](https://aider.chat/docs/leaderboards/).
|
||||
|
||||
Below is the aider chat transcript, which shows:
|
||||
|
||||
- I launch aider with the yaml file, a file with other plots I've done recently (so GPT can crib the style) and an empty file called `over_time.py`.
|
||||
- Then I ask GPT to draw the scatterplot I want.
|
||||
- I run the resulting script and share the error output with GPT so it can fix a small bug.
|
||||
- I ask it to color the points for GPT-4 and GPT-3.5 family models differently, to better see trends within those model families.
|
||||
- I add a PNG of the graph to the chat, so GPT can see the current graph and understand why I want it to move the point labels so they don't overlap the points.
|
||||
- I work through a series of other small style changes, like changing fonts and the graph border.
|
||||
|
||||
In the end I have the graph, but I also have the python code in my repo.
|
||||
So I can update this graph easily whenever I add new entries to the yaml data file.
|
||||
|
||||
|
||||
## Aider chat transcript
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
> Aider v0.35.1-dev
|
||||
> Models: gpt-4o with diff edit format, weak model gpt-3.5-turbo
|
||||
> Added _data/edit_leaderboard.yml to the chat.
|
||||
> Added benchmark/over_time.py to the chat.
|
||||
> Added benchmark/plots.py to the chat.
|
||||
|
||||
#### in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.
|
||||
|
||||
To draw a scatter plot for each of the YAML entries with a `released` date, we need to:
|
||||
|
||||
1. Read the YAML file.
|
||||
2. Extract the `released` date and `pass_rate_2` for each entry.
|
||||
3. Plot the data using matplotlib.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
=======
|
||||
import matplotlib.pyplot as plt
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
|
||||
def plot_over_time(yaml_file):
|
||||
with open(yaml_file, 'r') as file:
|
||||
data = yaml.safe_load(file)
|
||||
|
||||
dates = []
|
||||
pass_rates = []
|
||||
models = []
|
||||
|
||||
for entry in data:
|
||||
if 'released' in entry and 'pass_rate_2' in entry:
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
pass_rates.append(entry['pass_rate_2'])
|
||||
models.append(entry['model'])
|
||||
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
|
||||
|
||||
for i, model in enumerate(models):
|
||||
plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
plt.xlabel('Release Date')
|
||||
plt.ylabel('Pass Rate 2')
|
||||
plt.title('Model Performance Over Time')
|
||||
plt.grid(True)
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Example usage
|
||||
plot_over_time('_data/edit_leaderboard.yml')
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 49ac9ca aider: Added scatter plot functionality for YAML entries with a `released` date and `pass_rate_2`.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### /run python benchmark/over_time.py
|
||||
> Traceback (most recent call last):
|
||||
File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
|
||||
plot_over_time('_data/edit_leaderboard.yml')
|
||||
File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
TypeError: strptime() argument 1 must be str, not datetime.date
|
||||
> Add the output to the chat? y
|
||||
> Traceback (most recent call last):
|
||||
> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
|
||||
> plot_over_time('_data/edit_leaderboard.yml')
|
||||
> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
|
||||
> dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
> TypeError: strptime() argument 1 must be str, not datetime.date
|
||||
|
||||
The error indicates that the `released` field in the YAML file is already a `datetime.date` object, so we don't need to parse it with `datetime.strptime`. We can directly use it in the plot.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
=======
|
||||
dates.append(entry['released'])
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit af2f4b7 aider: Use the `released` field directly in the plot instead of parsing it with `datetime.strptime`.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### color the "gpt-4" models differently
|
||||
|
||||
To color the "gpt-4" models differently, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-4" in the model name and assign a different color.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
|
||||
=======
|
||||
colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
|
||||
plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
This change will color the points red if the model name contains "gpt-4", otherwise, it will color them blue.
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 0d11357 aider: Color the "gpt-4" models differently.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### color the gpt-3.5 models a new color
|
||||
|
||||
To color the "gpt-3.5" models a new color, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-3.5" in the model name and assign a different color.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
|
||||
=======
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 85d9d84 aider: Color the "gpt-3.5" models green in the plotting code.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### use the colorscheme and fonts from the other graphs in plots.py
|
||||
|
||||
To use the colorscheme and fonts from the graphs in `plots.py`, we need to:
|
||||
|
||||
1. Import the necessary settings from `plots.py`.
|
||||
2. Apply these settings to the graph in `over_time.py`.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
=======
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
from matplotlib import rc
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.figure(figsize=(10, 6))
|
||||
=======
|
||||
plt.rcParams["hatch.linewidth"] = 0.5
|
||||
plt.rcParams["hatch.color"] = "#444444"
|
||||
|
||||
rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
|
||||
|
||||
fig, ax = plt.subplots(figsize=(6, 4))
|
||||
ax.grid(axis="y", zorder=0, lw=0.2)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
=======
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
plt.xlabel('Release Date')
|
||||
plt.ylabel('Pass Rate 2')
|
||||
plt.title('Model Performance Over Time')
|
||||
plt.grid(True)
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
=======
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
ax.set_xlabel('Release Date')
|
||||
ax.set_ylabel('Pass Rate 2')
|
||||
ax.set_title('Model Performance Over Time')
|
||||
plt.tight_layout()
|
||||
plt.savefig("tmp_over_time.svg")
|
||||
imgcat(fig)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 8d670f1 aider: Used the colorscheme and fonts from the graphs in `plots.py` in `over_time.py`.
|
||||
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### /add tmp_over_time.png
|
||||
> Added /Users/gauthier/Projects/aider/tmp_over_time.png to the chat
|
||||
|
||||
#### make the labels not overlap the dots
|
||||
|
||||
To make the labels not overlap the dots, we can adjust the position of the labels slightly using the `xytext` parameter in the `annotate` function.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
=======
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75,
|
||||
xytext=(5, 5), textcoords='offset points')
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Committing benchmark/over_time.py before applying edits.
|
||||
> Commit 49eb72c Updated image file format from SVG to PNG and added imgcat library for displaying the plot.
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 75fa061 aider: Adjust the position of labels to prevent overlap with dots.
|
||||
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### make the dots bigger
|
||||
|
||||
To make the dots bigger, we can adjust the `s` parameter in the `scatter` function, which controls the size of the dots.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
=======
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5, s=100)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit a018fdf aider: Make the dots bigger.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### make the border of the graph thinner and lighter
|
||||
|
||||
To make the border of the graph thinner and lighter, we can adjust the `spines` properties of the axes.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
=======
|
||||
for spine in ax.spines.values():
|
||||
spine.set_edgecolor('#DDDDDD')
|
||||
spine.set_linewidth(0.5)
|
||||
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 77ba518 aider: Made the border of the graph thinner and lighter by adjusting the `spines` properties of the axes.
|
||||
|
||||
</div>
|
14
aider/website/_posts/2024-05-22-draft.md
Normal file
14
aider/website/_posts/2024-05-22-draft.md
Normal file
|
@ -0,0 +1,14 @@
|
|||
---
|
||||
title: A draft post.
|
||||
excerpt: With a draft summary.
|
||||
highlight_image: /assets/linting.jpg
|
||||
draft: true
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# A draft post
|
||||
|
||||
Content TBD.
|
149
aider/website/_posts/2024-05-22-linting.md
Normal file
149
aider/website/_posts/2024-05-22-linting.md
Normal file
|
@ -0,0 +1,149 @@
|
|||
---
|
||||
title: Linting code for LLMs with tree-sitter
|
||||
excerpt: Aider now lints code after every LLM edit and automatically fixes errors, using tree-sitter and AST-aware code context.
|
||||
highlight_image: /assets/linting.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
[](https://aider.chat/assets/linting.jpg)
|
||||
|
||||
# Linting code for LLMs with tree-sitter
|
||||
|
||||
Aider now lints your code after every LLM edit, and offers to automatically fix
|
||||
any linting errors.
|
||||
You can also use aider's lint-and-fix functionality on your source files any time
|
||||
you like, to speedily resolve issues with code written by humans.
|
||||
|
||||
Aider shows linting errors to the LLM in a novel format,
|
||||
using tree-sitter
|
||||
to help display relevant code context for each
|
||||
error.
|
||||
This increases the ability of the LLM to understand the problem and
|
||||
make the correct changes to resolve it.
|
||||
|
||||
Aider ships with basic linters built with tree-sitter that support
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
These built in linters will detect syntax errors and other fatal problems with the code.
|
||||
|
||||
You can also configure aider to use your preferred linters.
|
||||
This allows aider to check for a larger class of problems, keep the code style
|
||||
aligned with the rest of your team, etc.
|
||||
|
||||
## Linting and fixing your code
|
||||
|
||||
Aider now lints each source file after it applies the edits
|
||||
suggested by an LLM.
|
||||
If problems are found, aider will ask if you'd like it to
|
||||
attempt to fix the errors.
|
||||
If so, aider will send the LLM a report of the lint errors
|
||||
and request changes to fix them. This process may iterate a few times
|
||||
as the LLM works to fully resolve all the issues.
|
||||
|
||||
You can also lint and fix files any time, on demand from within the aider chat or via the
|
||||
command line:
|
||||
|
||||
- The in-chat `/lint` command will lint and fix all the files which have
|
||||
been added to the chat by default. Or you can name any files
|
||||
in your git repo as arguments.
|
||||
- From the command line, you can run `aider --lint` to lint and fix
|
||||
all the dirty files in the repo.
|
||||
Or you can specify specific filenames on the command line.
|
||||
|
||||
|
||||
## An LLM-friendly lint report
|
||||
|
||||
Most linting tools produce terse and cryptic output,
|
||||
which is one reason many engineers appreciate IDEs that highlight
|
||||
linting errors.
|
||||
LLM's don't have the luxury of using an IDE, so aider sends
|
||||
the linting errors in an LLM friendly format.
|
||||
|
||||
Here's an example of raw output of the `flake8` python linter:
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
app.py:41:16: F541 f-string is missing placeholders
|
||||
```
|
||||
|
||||
This sort of output depends on the user to reference line numbers to find and fix
|
||||
each reported error.
|
||||
LLMs are quite bad at working with source code line numbers, often
|
||||
making off-by-one errors and other mistakes even when provided with
|
||||
a fully numbered code listing.
|
||||
|
||||
Aider augments the raw linter by
|
||||
displaying and
|
||||
highlighting the lines that have errors within their
|
||||
containing functions, methods, classes.
|
||||
To do this, aider uses tree-sitter to obtain the code's AST and analyzes it
|
||||
in light of the linting errors.
|
||||
LLMs are more effective at editing code that's provided
|
||||
with context like this.
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
app.py:41:16: F541 f-string is missing placeholders
|
||||
|
||||
app.py:
|
||||
...⋮...
|
||||
6│class LongNum:
|
||||
7│ def __init__(self, num):
|
||||
8│ """
|
||||
9│ Initialize the number.
|
||||
10│ """
|
||||
...⋮...
|
||||
19│ def __str__(self):
|
||||
20│ """
|
||||
21│ Render the number as a string.
|
||||
22│ """
|
||||
23█ return str(num)
|
||||
24│
|
||||
25│
|
||||
26│@app.route('/subtract/<int:x>/<int:y>')
|
||||
...⋮...
|
||||
38│@app.route('/divide/<int:x>/<int:y>')
|
||||
39│def divide(x, y):
|
||||
40│ if y == 0:
|
||||
41█ return f"Error: Cannot divide by zero"
|
||||
42│ else:
|
||||
43│ result = x / y
|
||||
44│ return str(result)
|
||||
45│
|
||||
...⋮...
|
||||
```
|
||||
|
||||
## Basic linters for most popular languages
|
||||
|
||||
Aider comes batteries-included with built in linters for
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
This provides wide support for linting without requiring
|
||||
users to manually install a linter and configure it to work with aider.
|
||||
|
||||
Aider's built in language-agnostic linter uses tree-sitter to parse
|
||||
the AST of each file.
|
||||
When tree-sitter encounters a syntax error or other fatal issue
|
||||
parsing a source file, it inserts an AST node with type `ERROR`.
|
||||
Aider simply uses these `ERROR` nodes to identify all the lines
|
||||
with syntax or other types of fatal error, and displays
|
||||
them in the LLM friendly format described above.
|
||||
|
||||
## Configuring your preferred linters
|
||||
|
||||
You can optionally configure aider to use
|
||||
your preferred linters with the `--lint-cmd` switch.
|
||||
|
||||
```
|
||||
# To lint javascript with jslint
|
||||
aider --lint-cmd javascript:jslint
|
||||
|
||||
# To lint python with flake8 using some specific args:
|
||||
aider --lint-cmd "python:flake8 --select=E9,F821,F823..."
|
||||
```
|
||||
|
||||
You can provide multiple `--lint-cmd` switches
|
||||
to set linters for various languages.
|
||||
You can also durably set linters in your `.aider.conf.yml` file.
|
||||
|
454
aider/website/_posts/2024-05-22-swe-bench-lite.md
Normal file
454
aider/website/_posts/2024-05-22-swe-bench-lite.md
Normal file
|
@ -0,0 +1,454 @@
|
|||
---
|
||||
title: How aider scored SOTA 26.3% on SWE Bench Lite
|
||||
excerpt: Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||
highlight_image: /assets/swe_bench_lite.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# How aider scored SOTA 26.3% on SWE Bench Lite
|
||||
|
||||
[Aider scored 26.3%](https://github.com/swe-bench/experiments/pull/7)
|
||||
on the
|
||||
[SWE Bench Lite benchmark](https://www.swebench.com),
|
||||
achieving a state-of-the-art result.
|
||||
The previous top leaderboard entry was 20.3%
|
||||
from Amazon Q Developer Agent.
|
||||
|
||||
See also [aider's SOTA result on the main SWE Bench](https://aider.chat/2024/06/02/main-swe-bench.html).
|
||||
|
||||
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||
|
||||
**All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.**
|
||||
All results in the above chart are unhinted pass@1 results.
|
||||
Please see the [references](#references)
|
||||
for details on the data presented in this chart.
|
||||
It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
|
||||
using pass@1 results from AutoCodeRover
|
||||
and results from OpenDevin that don't use hints.
|
||||
The [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
||||
only accepts pass@1 results that do not use hints.
|
||||
|
||||
## Interactive, not agentic
|
||||
|
||||
Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||
to avoid long delays, high token costs
|
||||
and the need for users to repeatedly code review incorrect solutions.
|
||||
It's also worth noting that aider currently does not use
|
||||
RAG, vector search, tools or give the LLM access to search the web
|
||||
or unilaterally execute code.
|
||||
|
||||
Aider is first and foremost an interactive tool for engineers to get real work done in
|
||||
real code bases using a chat interface.
|
||||
Aider provides a pair programming UX where users can ask for a change
|
||||
and see the edits performed in real-time.
|
||||
Aider can also offer additional help like fixing lint or test errors,
|
||||
but the user is always in full interactive control.
|
||||
This lets them quickly steer misunderstandings back on course and
|
||||
avoid wasting time and token costs.
|
||||
|
||||
|
||||
## Benchmark methodology
|
||||
|
||||
For the benchmark,
|
||||
aider was launched in each problem's git repository
|
||||
with the problem statement
|
||||
submitted as the opening chat message from "the user."
|
||||
After that aider runs as normal, with the following modifications:
|
||||
|
||||
- Aider's suggestions were always accepted without user approval.
|
||||
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||
without causing syntax errors or breaking any *pre-existing* tests.
|
||||
- If the solution isn't plausible, the harness launches aider to try again from scratch,
|
||||
alternating between using aider with GPT-4o and Opus.
|
||||
- If no plausible solution is found after six tries, the harness picks the solution
|
||||
with the fewest edit/lint/test problems.
|
||||
|
||||
It's important to be clear that
|
||||
*aider and the benchmark harness
|
||||
only had access to the pre-existing tests in each problem's repo*.
|
||||
The held out "acceptance tests" were *only* used
|
||||
after benchmarking to compute statistics on which problems aider
|
||||
correctly resolved.
|
||||
|
||||
The [full harness to run aider on SWE Bench Lite is available on GitHub](https://github.com/paul-gauthier/aider-swe-bench).
|
||||
|
||||
The benchmarking process was similar to how a developer might use aider to
|
||||
resolve a GitHub issue:
|
||||
|
||||
- They could launch aider in their repo with the command below, which
|
||||
tells aider they want to accept every suggestion
|
||||
and to use pytest to run tests.
|
||||
- `aider --yes --test-cmd pytest`
|
||||
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||
Aider will pull in the URL's content and then try and solve the issue.
|
||||
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
|
||||
[Aider is tightly integrated with git](https://aider.chat/docs/git.html),
|
||||
so it's always easy to revert AI changes that don't pan out.
|
||||
|
||||
Outside a benchmark setting, it's probably
|
||||
unwise or at least highly inefficient
|
||||
to let *any* AI agent run unsupervised on your code base.
|
||||
The reason aider is intended to be used interactively
|
||||
is so that the user can participate and direct aider's work and approve suggestions.
|
||||
This way the user can offer immediate feedback or corrections if their initial
|
||||
instructions turn out to be ambiguous,
|
||||
or if the AI starts going down a wrong path.
|
||||
|
||||
## Aider with GPT-4o alone was SOTA
|
||||
|
||||
Running the benchmark harness
|
||||
only using aider with GPT-4o to find plausible solutions
|
||||
achieved a score of 25.0%.
|
||||
This was itself matching the state-of-the-art, before being surpassed by the main
|
||||
result being reported here
|
||||
that used aider with both GPT-4o & Opus.
|
||||
|
||||
As noted below, a single attempt using Aider with GPT-4o tied
|
||||
the current top entry on the leaderboard.
|
||||
|
||||
## Aider with GPT-4o & Opus
|
||||
|
||||
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||
The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||
then alternating with Opus until a plausible solution was found for each
|
||||
problem.
|
||||
|
||||
The table below breaks down the plausible solutions that
|
||||
were found for the 300 problems.
|
||||
It also provides details on the 79 that were ultimately
|
||||
verified as correctly resolving their issue.
|
||||
Some noteworthy observations:
|
||||
|
||||
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
||||
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark.
|
||||
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
||||
- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
||||
|
||||
|
||||
| Attempt | Agent |Number of<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite |
|
||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
|
||||
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
|
||||
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
|
||||
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
|
||||
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
|
||||
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
|
||||
| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
|
||||
|
||||
|
||||
If we break down the solutions solely by model,
|
||||
we can see that aider with GPT-4o outperforms Opus.
|
||||
This isn't a fair and direct comparison, because GPT-4o always took the first
|
||||
turn and therefore got first crack at all the "easiest" problems.
|
||||
Aider with Opus only ever saw problems that GPT-4o failed to
|
||||
find plausible solutions for on its first try.
|
||||
|
||||
Aider with GPT-4o was producing higher quality plausible solutions,
|
||||
with a greater chance of going on to be accepted as resolving the issue.
|
||||
Again, this is biased by the turn ordering.
|
||||
But other anecdotal evidence from earlier runs of the benchmark
|
||||
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||
for this benchmark.
|
||||
|
||||
|
||||
| Agent | Number of<br>plausible<br>solutions | Number of<br>correctly<br>resolved<br>solutions | Percent of<br>plausible<br>which<br>correctly<br>resolved<br>|
|
||||
|------------|---------:|---------:|---:|
|
||||
| Aider with GPT-4o | 239 | 66 |27.6% |
|
||||
| Aider with Opus | 61 | 13 |21.3% |
|
||||
| **Total** | **300** | **79** |**26.3%** |
|
||||
|
||||
## Repository map, not RAG
|
||||
|
||||
The crucial first step in solving a SWE Bench problem is figuring out
|
||||
which parts of the repo are relevant and which files need to be edited.
|
||||
Most coding agents use some combination of RAG, vector search
|
||||
and providing the LLM with
|
||||
tools to interactively explore the code base.
|
||||
|
||||
Aider instead uses a
|
||||
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
||||
to help the LLM understand the
|
||||
layout, code structure, and content of a git repo.
|
||||
The repo map is created through static analysis of the code's
|
||||
abstract syntax tree and call graph
|
||||
to provide a compact and powerful summary of the entire code base.
|
||||
The map is constantly
|
||||
tailored to show
|
||||
repo context that is relevant to the current state of the chat conversation.
|
||||
This is done by performing a graph optimization on the code's call graph.
|
||||
|
||||
When the user asks for a change to their code, the LLM can use the repo map
|
||||
to decide which files to edit.
|
||||
The LLM simply returns a normal text response explaining which files
|
||||
it needs to edit and why.
|
||||
Aider notices when the LLM mentions filenames from the repo,
|
||||
and asks the user if they should be added to the chat.
|
||||
Adding a file to the chat allows the LLM to see the full contents
|
||||
of the file and edit it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### Please add a new /factorial/N endpoint.
|
||||
|
||||
To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
|
||||
Please add app.py to the chat so I can proceed with the changes.
|
||||
|
||||
> app.py
|
||||
> Add these files to the chat? yes
|
||||
|
||||
</div>
|
||||
|
||||
This is a convenient and natural workflow for interactive chat,
|
||||
and it worked well for the SWE Bench problems.
|
||||
Aider successfully identified the correct file to edit
|
||||
in 70.3% of the benchmark tasks.
|
||||
|
||||
We can determine which file needs to be edited using the "gold" patch
|
||||
which is associated with each SWE Bench task.
|
||||
This patch was created by a human developer
|
||||
to solve the issue, and therefore reveals a file which can
|
||||
be edited to solve the problem.
|
||||
Of course aider is not able to see or use the gold patch
|
||||
or the file names it contains in any way.
|
||||
This information was only used to compute
|
||||
statistics outside the benchmarking process.
|
||||
|
||||
|
||||
## Reliable code editing
|
||||
|
||||
Once files have been selected for editing,
|
||||
the next step is of course to edit the source code to fix the problem.
|
||||
|
||||
Aider goes to great lengths to ensure that LLMs can not just write code,
|
||||
but reliably *edit* code.
|
||||
Aider has a collection of prompting strategies and code editing backends which have
|
||||
been honed through
|
||||
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
||||
These foundational capabilities help ensure that aider can
|
||||
properly integrate code from LLMs into an existing code base and source files.
|
||||
|
||||
The repository map helps here too, making sure that the LLM
|
||||
can see relevant classes, functions and variables from the entire repo.
|
||||
This helps ensure that the project's existing APIs and conventions are
|
||||
respected and utilized when new code is added.
|
||||
|
||||
Regardless, there are still cases where aider may be unable to cleanly
|
||||
complete the edits specified by the LLM.
|
||||
This is usually because the LLM has failed to conform to the editing
|
||||
instructions in its system prompt.
|
||||
When aider completes, it returns an editing outcome that indicates
|
||||
whether it was able to successfully apply all edits.
|
||||
The benchmark harness uses this editing status as
|
||||
one criteria to determine if aider has
|
||||
created a plausible solution.
|
||||
|
||||
## Linting and fixing
|
||||
|
||||
Another key criteria for a plausible solution is that it passes basic
|
||||
linting, which means that the code has no syntax
|
||||
or other fatal errors.
|
||||
[Aider lints code](https://aider.chat/2024/05/22/linting.html)
|
||||
after every LLM edit and offers to automatically fix
|
||||
any problems.
|
||||
|
||||
Aider ships with built-in linters based on tree-sitter
|
||||
which work with most popular programming languages.
|
||||
Aider shows linting errors to the LLM in a novel format,
|
||||
using the abstract syntax tree to display relevant code context for each
|
||||
error.
|
||||
This context helps LLMs understand the problem and
|
||||
make the correct changes to resolve it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
|
||||
app.py:
|
||||
...⋮...
|
||||
6│class LongNum:
|
||||
...⋮...
|
||||
19│ def expound(self, threshold):
|
||||
20│ number = self.basis
|
||||
21│ while number < threshold:
|
||||
22│ number *= self.factor
|
||||
23█ return num
|
||||
24│
|
||||
25│
|
||||
...⋮...
|
||||
```
|
||||
|
||||
> Attempt to fix lint errors? yes
|
||||
|
||||
</div>
|
||||
|
||||
In the benchmark, these linting suggestions are always accepted.
|
||||
At completion,
|
||||
aider reports a linting outcome that
|
||||
indicates if it was able to produce
|
||||
code without any outstanding linting errors.
|
||||
The benchmark harness uses this status as
|
||||
one of the criteria to determine if aider has
|
||||
created a plausible solution.
|
||||
|
||||
## Testing and fixing
|
||||
|
||||
The final crtieria for a plausible solution is that
|
||||
all tests must be passing.
|
||||
Aider can be configured with the command to run tests for a repo,
|
||||
and will automatically attempt to fix any test failures.
|
||||
|
||||
A user working on a python project might configure testing
|
||||
by launching aider like this:
|
||||
|
||||
```
|
||||
aider --test-cmd pytest
|
||||
```
|
||||
|
||||
For the benchmark, aider is configured with a test command that will run the
|
||||
tests that already exist in each problem's repository.
|
||||
SWE Bench problems are based on repositories from large open
|
||||
source projects with extensive existing test suites.
|
||||
This means that
|
||||
testing will fail if aider has broken any of these
|
||||
pre-existing tests or if any new
|
||||
tests that it created aren't passing.
|
||||
|
||||
As with editing and linting, aider reports a testing outcome
|
||||
that indicates if it completed with any outstanding failing tests.
|
||||
The benchmark harness uses this status when deciding if aider
|
||||
has produced a plausible solution.
|
||||
|
||||
To be clear, *aider cannot run or even see the held out "acceptance tests"* that
|
||||
are used to judge if a proposed solution correctly
|
||||
resolves the problem.
|
||||
Those tests are only run outside of aider and the benchmark harness,
|
||||
to compute the final benchmark statistics.
|
||||
|
||||
## Finding a plausible solution
|
||||
|
||||
Each time aider executes, it reports
|
||||
the outcome of the editing, linting, and testing
|
||||
steps.
|
||||
Each of these steps may complete successfully or
|
||||
return a status that indicates that there were outstanding
|
||||
problems that remain unresolved.
|
||||
|
||||
The benchmark harness uses these outcomes to determine if
|
||||
aider has produced a plausible
|
||||
solution to the current SWE Bench task.
|
||||
A plausible solution is one where aider
|
||||
returns saying that it
|
||||
edited the repo with no outstanding
|
||||
edit, lint, or test errors.
|
||||
In this case, aider's changes are recorded
|
||||
as the SWE Bench `model_patch` to be evaluated later with the
|
||||
acceptance tests.
|
||||
|
||||
If the solution is not plausible, another
|
||||
instance of aider is launched again from scratch on the same problem.
|
||||
The harness alternates launching aider with GPT-4o and Opus to solve the problem,
|
||||
and gives each model three attempts -- for a total of six attempts.
|
||||
As soon as a plausible solution is found, it is accepted and the
|
||||
harness moves on to the next SWE Bench instance.
|
||||
|
||||
It's worth noting that repositories may have lint or test errors
|
||||
present before aider even starts to edit them.
|
||||
Whether unresolved errors were caused by aider or were pre-existing,
|
||||
there will be instances where
|
||||
no plausible solution is
|
||||
found after six tries.
|
||||
|
||||
If all six attempts fail to produce a plausible solution,
|
||||
then the "best" solution available is selected as the
|
||||
`model_patch`.
|
||||
Which of the non-plausible solutions to use is determined
|
||||
by ignoring the testing outcome
|
||||
and prioritizing solutions in the following order:
|
||||
|
||||
- Pick a solution where editing and linting were completed successfully.
|
||||
- Pick a solution where editing was at least partially successful and linting succeeded.
|
||||
- Pick a solution where editing was successful.
|
||||
- Pick a solution where editing was at least partially successful.
|
||||
|
||||
## Computing the benchmark score
|
||||
|
||||
The benchmark harness produced a plausible solution for each of the 300
|
||||
SWE Bench Lite instances and saved it as the `model_patch`.
|
||||
|
||||
A separate evaluation script was used to
|
||||
test each of these solutions with the full test suite,
|
||||
including the held out acceptance tests.
|
||||
For this final acceptance testing, any edits that aider made to tests
|
||||
are discarded.
|
||||
This ensures that the correct,
|
||||
unmodified test suite is used for acceptance testing.
|
||||
The evaluation script compares the test results
|
||||
with results from testing
|
||||
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||
If they match, the candidate solution has correctly resolved the issue.
|
||||
|
||||
These acceptance tests are only ever run outside of aider
|
||||
and the benchmark harness, and only to compute the number of
|
||||
correctly resolved instances.
|
||||
They are never run, used, or even visible during aider's attempts to solve the problems.
|
||||
|
||||
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Much thanks to the team behind the
|
||||
[SWE Bench](https://www.swebench.com)
|
||||
family of AI coding benchmarks.
|
||||
Also thanks to Albert Örwall who has
|
||||
[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
|
||||
making it faster, easier, and more reliable to run the acceptance tests.
|
||||
|
||||
|
||||
## References
|
||||
|
||||
All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.
|
||||
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
Thus it is a pass@1 result.
|
||||
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N solution pass, that counts as a pass@N success.
|
||||
|
||||
Below are the references for the other pass@1 unhinted SWE-Bench results
|
||||
displayed in the graph at the beginning of this article.
|
||||
|
||||
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
|
||||
- [19.0% AutoCodeRover](https://www.swebench.com/)
|
||||
- [18.0% SWE-Agent + GPT-4](https://www.swebench.com)
|
||||
- [16.7% OpenDevin](https://github.com/OpenDevin/OpenDevin/issues/2149)
|
||||
- [11.7% SWE-Agent + Opus](https://www.swebench.com)
|
||||
|
||||
Note, the graph was corrected on 5/30/24 as follows.
|
||||
|
||||
The graph now contains AutoCodeRover's average pass@1 results.
|
||||
Previously it displayed pass@3 results, which are
|
||||
not comparable
|
||||
to the pass@1 results for aider being reported here.
|
||||
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
|
||||
features pass@3 results
|
||||
without being clearly labeled.
|
||||
|
||||
The graph now contains the best OpenDevin results obtained without using
|
||||
the SWE Bench `hints_text` to provide hints to the agent.
|
||||
The previous graph contained their hinted result,
|
||||
which is not comparable
|
||||
to the unhinted aider results being reported here.
|
||||
[OpenDevin reported hinted results](https://x.com/gneubig/status/1791498953709752405)
|
||||
without noting that hints were used.
|
70
aider/website/_posts/2024-05-24-self-assembly.md
Normal file
70
aider/website/_posts/2024-05-24-self-assembly.md
Normal file
|
@ -0,0 +1,70 @@
|
|||
---
|
||||
title: Aider has written 7% of its own code
|
||||
excerpt: Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.
|
||||
highlight_image: /assets/self-assembly.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Aider has written 7% of its own code
|
||||
|
||||
[](https://aider.chat/assets/self-assembly.jpg)
|
||||
|
||||
The
|
||||
[aider git repo](https://github.com/paul-gauthier/aider)
|
||||
currently contains about 4K commits and 14K lines of code.
|
||||
|
||||
Aider made 15% of the commits, inserting 4.8K and deleting 1.5K lines of code.
|
||||
|
||||
About 7% of the code now in the repo is attributable to an aider commit
|
||||
using `git blame`.
|
||||
This number is probably a significant undercount, because periodic reformatting
|
||||
by `black` is likely obscuring aider's authorship of many lines.
|
||||
|
||||
Here's the breakdown of the code aider wrote in the current code base
|
||||
according to `git blame`.
|
||||
|
||||
| File | Lines | Percent |
|
||||
|---|---:|---:|
|
||||
|aider/args.py| 6 of 449 | 1.3% |
|
||||
|aider/coders/base_coder.py| 37 of 1354 | 2.7% |
|
||||
|aider/coders/editblock_coder.py| 14 of 507 | 2.8% |
|
||||
|aider/coders/editblock_func_coder.py| 6 of 141 | 4.3% |
|
||||
|aider/coders/udiff_coder.py| 2 of 421 | 0.5% |
|
||||
|aider/coders/wholefile_coder.py| 5 of 146 | 3.4% |
|
||||
|aider/coders/wholefile_func_coder.py| 4 of 134 | 3.0% |
|
||||
|aider/commands.py| 67 of 703 | 9.5% |
|
||||
|aider/diffs.py| 15 of 129 | 11.6% |
|
||||
|aider/gui.py| 2 of 533 | 0.4% |
|
||||
|aider/history.py| 19 of 124 | 15.3% |
|
||||
|aider/io.py| 55 of 368 | 14.9% |
|
||||
|aider/linter.py| 30 of 240 | 12.5% |
|
||||
|aider/main.py| 30 of 466 | 6.4% |
|
||||
|aider/mdstream.py| 3 of 122 | 2.5% |
|
||||
|aider/models.py| 22 of 549 | 4.0% |
|
||||
|aider/repo.py| 19 of 266 | 7.1% |
|
||||
|aider/repomap.py| 17 of 518 | 3.3% |
|
||||
|aider/scrape.py| 12 of 199 | 6.0% |
|
||||
|aider/versioncheck.py| 10 of 37 | 27.0% |
|
||||
|aider/voice.py| 9 of 104 | 8.7% |
|
||||
|benchmark/benchmark.py| 33 of 730 | 4.5% |
|
||||
|benchmark/over_time.py| 32 of 60 | 53.3% |
|
||||
|benchmark/swe_bench_lite.py| 40 of 71 | 56.3% |
|
||||
|scripts/blame.py| 55 of 212 | 25.9% |
|
||||
|scripts/versionbump.py| 96 of 123 | 78.0% |
|
||||
|setup.py| 11 of 47 | 23.4% |
|
||||
|tests/test_coder.py| 48 of 612 | 7.8% |
|
||||
|tests/test_commands.py| 135 of 588 | 23.0% |
|
||||
|tests/test_editblock.py| 23 of 403 | 5.7% |
|
||||
|tests/test_io.py| 30 of 65 | 46.2% |
|
||||
|tests/test_main.py| 13 of 239 | 5.4% |
|
||||
|tests/test_models.py| 6 of 28 | 21.4% |
|
||||
|tests/test_repo.py| 2 of 296 | 0.7% |
|
||||
|tests/test_repomap.py| 70 of 217 | 32.3% |
|
||||
|tests/test_udiff.py| 7 of 119 | 5.9% |
|
||||
|tests/test_wholefile.py| 37 of 321 | 11.5% |
|
||||
| **Total** | **1022 of 14219** | 7.2% |
|
||||
|
||||
|
267
aider/website/_posts/2024-06-02-main-swe-bench.md
Normal file
267
aider/website/_posts/2024-06-02-main-swe-bench.md
Normal file
|
@ -0,0 +1,267 @@
|
|||
---
|
||||
title: Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||
excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
|
||||
highlight_image: /assets/swe_bench.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||
|
||||
Aider scored 18.9%
|
||||
on the main
|
||||
[SWE Bench benchmark](https://www.swebench.com),
|
||||
achieving a state-of-the-art result.
|
||||
The current top leaderboard entry is 13.8%
|
||||
from Amazon Q Developer Agent.
|
||||
The best result reported elsewhere seems to be
|
||||
[13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
|
||||
|
||||
This result on the main SWE Bench builds on
|
||||
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
|
||||
[](https://aider.chat/assets/swe_bench.svg)
|
||||
|
||||
**All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.**
|
||||
Aider was benchmarked on the same
|
||||
[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
|
||||
that were used in the
|
||||
[Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
|
||||
See the [references](#references)
|
||||
for more details on the data presented in this chart.
|
||||
|
||||
## Interactive, not agentic
|
||||
|
||||
Aider achieved this result mainly through its existing features that focus on static
|
||||
code analysis, reliable LLM code editing, and pragmatic UX for automatically
|
||||
fixing linting and testing errors.
|
||||
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||
to avoid long delays, high token costs
|
||||
and the need for users to repeatedly code review incorrect solutions.
|
||||
It's also worth noting that aider currently does not use
|
||||
RAG, vector search, tools or give the LLM access to search the web
|
||||
or unilaterally execute code.
|
||||
|
||||
Aider is first and foremost an interactive tool for engineers to get real work done in
|
||||
real code bases using a chat interface.
|
||||
Aider provides a pair programming UX where users can ask for a change
|
||||
and see code edits performed in real-time.
|
||||
Aider can also offer additional help like fixing lint or test errors,
|
||||
but the user is always in full interactive control.
|
||||
This allows them to quickly steer misunderstandings back on course and
|
||||
avoid wasting time and token costs.
|
||||
|
||||
|
||||
## Benchmark methodology
|
||||
|
||||
Benchmarking was conducted as follows:
|
||||
|
||||
- Aider with GPT-4o was launched in each problem's git repository
|
||||
with the problem statement
|
||||
submitted as the opening chat message from "the user".
|
||||
- After that aider ran as normal, except all of aider's
|
||||
suggestions were always accepted without user approval.
|
||||
- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||
without causing syntax errors or breaking any *pre-existing* tests.
|
||||
- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
|
||||
- If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems.
|
||||
|
||||
It's important to be clear that
|
||||
*aider and the benchmark harness
|
||||
only had access to the pre-existing tests in each problem's repo*.
|
||||
The held out "acceptance tests" were *only* used
|
||||
after benchmarking to compute statistics on which problems aider
|
||||
correctly resolved.
|
||||
|
||||
This is the same approach
|
||||
that was used for
|
||||
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
For the Lite benchmark,
|
||||
aider alternated between GPT-4o and Opus for up to six total attempts.
|
||||
To manage the cost of running the main SWE Bench benchmark,
|
||||
aider was limited to two total attempts:
|
||||
one with GPT-4o and one with Opus.
|
||||
|
||||
For a detailed discussion of the benchmark
|
||||
methodology, see the
|
||||
[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
Also, the
|
||||
[aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
|
||||
contains the harness and statistics code used for the benchmarks.
|
||||
|
||||
The benchmarking process was similar to how a developer might use aider to
|
||||
resolve a GitHub issue:
|
||||
|
||||
- They could launch aider in their repo with the command below, which
|
||||
tells aider they want to accept every suggestion
|
||||
and to use pytest to run tests.
|
||||
- `aider --yes --test-cmd pytest`
|
||||
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||
Aider will pull in the URL's content and then try and resolve the issue.
|
||||
- If aider doesn't produce code that lints and tests clean, the user might decide to
|
||||
[use git to revert the changes](https://aider.chat/docs/git.html),
|
||||
and try again with `aider --opus`.
|
||||
|
||||
## Aider with GPT-4o alone was SOTA
|
||||
|
||||
Using aider with GPT-4o to make a single attempt at resolving each problem
|
||||
achieved a score of 17.0%.
|
||||
This was itself a state-of-the-art result, before being surpassed by the main
|
||||
result being reported here
|
||||
that used aider with both GPT-4o & Opus.
|
||||
|
||||
## Aider with GPT-4o & Opus
|
||||
|
||||
The benchmark harness started by using aider with GPT-4o to try
|
||||
and resolve each problem.
|
||||
For problems where this didn't produce a plausible solution,
|
||||
the harness tried again using aider with Opus.
|
||||
So at most, two attempts were made for each problem.
|
||||
|
||||
The table below breaks down the proposed solutions that
|
||||
were found from each attempt at the 570 problems.
|
||||
A proposed solution is either:
|
||||
|
||||
- A plausible solution where
|
||||
aider reported no outstanding errors from editing, linting and testing.
|
||||
- Or, the "most plausible" solution generated by either attempt, with the
|
||||
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
||||
|
||||
The table also provides details on the 108 solutions that were ultimately
|
||||
verified as correctly resolving their issue.
|
||||
|
||||
| Attempt | Agent |Number of<br>proposed<br>solutions|Percent of<br>proposed<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite |
|
||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
|
||||
| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
|
||||
| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
|
||||
|
||||
## Non-plausible but correct solutions?
|
||||
|
||||
A solution doesn't actually have to be plausible in order to correctly resolve the issue.
|
||||
Recall that plausible is simply defined as aider
|
||||
reporting that it successfully completed all file edits,
|
||||
repaired and resolved any linting errors
|
||||
and resolved any test failures.
|
||||
But there are many reasons why aider might fail to do those things
|
||||
and yet still produce a solution that will pass
|
||||
acceptance testing:
|
||||
|
||||
- There may have been pre-existing failing tests in the repo,
|
||||
before aider even started working on the SWE Bench problem.
|
||||
Aider may not have resolved such issues, and yet they may not be
|
||||
relevant to the acceptance testing.
|
||||
The SWE Bench acceptance testing just confirms that tests pass or fail
|
||||
in the same pattern as the "gold patch" developed by a human to resolve the
|
||||
problem.
|
||||
Some tests may fail during acceptance testing,
|
||||
and that's ok as long as they failed for the gold
|
||||
patch too.
|
||||
- There may have been pre-existing linting problems in the repo.
|
||||
If lingering linting issues affected code paths that are not well tested,
|
||||
they may not impact acceptance testing.
|
||||
- Aider may have reported file editing errors because it thought the LLM
|
||||
specified edits that it wasn't able to successfully apply.
|
||||
This can only happen when the LLM specified edits in
|
||||
a way that doesn't comply with the editing instructions in the system prompt.
|
||||
Given that the LLM isn't complying with the system prompt,
|
||||
it may have become confused and
|
||||
asked for redundant or otherwise irrelevant edits.
|
||||
Such outstanding edit errors might not be fatal for acceptance testing.
|
||||
- Etc.
|
||||
|
||||
Keeping all this in mind, we can understand why
|
||||
GPT-4o accounts for 15.3% of the benchmark score in the table above,
|
||||
but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
|
||||
When an Opus attempt is allowed after GPT-4o,
|
||||
it may propose some *incorrect* solutions which
|
||||
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
||||
These more plausible, incorrect solutions can
|
||||
eclipse some of
|
||||
the earlier non-plausible correct solutions that GPT-4o generated.
|
||||
This is why GPT-4o's score in the table
|
||||
showing the combined GPT-4o & Opus results (15.3%)
|
||||
is lower than the result from just one try using aider with GPT-4o (17.0%).
|
||||
|
||||
For these reasons, adding additional attempts is not guaranteed to monotonically
|
||||
increase the number of resolved problems.
|
||||
New solutions may resolve some new problems but they may also
|
||||
eclipse and discard some of the previous non-plausible correct solutions.
|
||||
|
||||
Luckily, the net effect of additional attempts
|
||||
usually increases or at least maintains the
|
||||
number of resolved solutions.
|
||||
This was the case for all the attempts made in both this main SWE Bench result and the
|
||||
earlier Lite result.
|
||||
|
||||
## Computing the benchmark score
|
||||
|
||||
The benchmark harness produced one proposed solution for each of
|
||||
the 570 SWE Bench problems.
|
||||
|
||||
A separate evaluation script was used to
|
||||
test each of these solutions with the full test suite,
|
||||
including the held out acceptance tests.
|
||||
For this final acceptance testing, any edits that aider made to tests
|
||||
were discarded.
|
||||
This ensured that the correct,
|
||||
unmodified test suite was used for acceptance testing.
|
||||
The evaluation script compared each proposed solution's test results
|
||||
with results from testing
|
||||
the "gold" patch that was developed by a human to correctly resolve the issue.
|
||||
If they matched, the proposed solution correctly resolved the issue.
|
||||
|
||||
These acceptance tests were only ever run outside of aider
|
||||
and the benchmark harness, and only to compute statistics about the
|
||||
correctly resolved instances.
|
||||
They were never run, used, or even visible during aider's attempts to resolve the problems.
|
||||
|
||||
Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
|
||||
or 18.9%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Much thanks to the team behind the
|
||||
[SWE Bench](https://www.swebench.com)
|
||||
family of AI coding benchmarks.
|
||||
Also thanks to Albert Örwall who has
|
||||
[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
|
||||
making it faster, easier, and more reliable to run the acceptance tests.
|
||||
|
||||
|
||||
## References
|
||||
|
||||
All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.
|
||||
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
Thus it is a pass@1 result.
|
||||
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N solution pass, that counts as a pass@N success.
|
||||
|
||||
Below are the references for the other pass@1 unhinted SWE-Bench results
|
||||
displayed in the graph at the beginning of this article.
|
||||
|
||||
- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
|
||||
- [13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
- [12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
- [10.6% AutoCode Rover, benchmarked on 2,294 instances.](https://arxiv.org/pdf/2404.05427v2)
|
||||
- [10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
|
||||
The graph contains average pass@1 results for AutoCodeRover.
|
||||
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
|
||||
features their pass@3 results
|
||||
without being clearly labeled.
|
||||
Table 2 of their
|
||||
[paper](https://arxiv.org/pdf/2404.05427v2)
|
||||
reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
|
||||
|
126
aider/website/_posts/2024-07-01-sonnet-not-lazy.md
Normal file
126
aider/website/_posts/2024-07-01-sonnet-not-lazy.md
Normal file
|
@ -0,0 +1,126 @@
|
|||
---
|
||||
title: Sonnet is the opposite of lazy
|
||||
excerpt: Claude 3.5 Sonnet can easily write more good code than fits in one 4k token API response.
|
||||
highlight_image: /assets/sonnet-not-lazy.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
[](https://aider.chat/assets/sonnet-not-lazy.jpg)
|
||||
|
||||
{% if page.date %}
|
||||
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
|
||||
{% endif %}
|
||||
|
||||
# Sonnet is the opposite of lazy
|
||||
|
||||
Claude 3.5 Sonnet represents a step change
|
||||
in AI coding.
|
||||
It is incredibly industrious, diligent and hard working.
|
||||
Unexpectedly,
|
||||
this presented a challenge:
|
||||
Sonnet
|
||||
was often writing so much code that
|
||||
it was hitting the 4k output token limit,
|
||||
truncating its coding in mid-stream.
|
||||
|
||||
Aider now works
|
||||
around this 4k limit and allows Sonnet to produce
|
||||
as much code as it wants.
|
||||
The result is surprisingly powerful.
|
||||
Sonnet's score on
|
||||
[aider's refactoring benchmark](https://aider.chat/docs/leaderboards/#code-refactoring-leaderboard)
|
||||
jumped from 55.1% up to 64.0%.
|
||||
This moved Sonnet into second place, ahead of GPT-4o and
|
||||
behind only Opus.
|
||||
|
||||
Users who tested Sonnet with a preview of
|
||||
[aider's latest release](https://aider.chat/HISTORY.html#aider-v0410)
|
||||
were thrilled:
|
||||
|
||||
- *Works like a charm. It is a monster. It refactors files of any size like it is nothing. The continue trick with Sonnet is truly the holy grail. Aider beats [other tools] hands down. I'm going to cancel both subscriptions.* -- [Emasoft](https://github.com/paul-gauthier/aider/issues/705#issuecomment-2200338971)
|
||||
- *Thanks heaps for this feature - it's a real game changer. I can be more ambitious when asking Claude for larger features.* -- [cngarrison](https://github.com/paul-gauthier/aider/issues/705#issuecomment-2196026656)
|
||||
- *Fantastic...! It's such an improvement not being constrained by output token length issues. [I refactored] a single JavaScript file into seven smaller files using a single Aider request.* -- [John Galt](https://discord.com/channels/1131200896827654144/1253492379336441907/1256250487934554143)
|
||||
|
||||
## Hitting the 4k token output limit
|
||||
|
||||
All LLMs have various token limits, the most familiar being their
|
||||
context window size.
|
||||
But they also have a limit on how many tokens they can output
|
||||
in response to a single request.
|
||||
Sonnet and the majority of other
|
||||
models are limited to returning 4k tokens.
|
||||
|
||||
Sonnet's amazing work ethic caused it to
|
||||
regularly hit this 4k output token
|
||||
limit for a few reasons:
|
||||
|
||||
1. Sonnet is capable of outputting a very large amount of correct,
|
||||
complete new code in one response.
|
||||
2. Similarly, Sonnet can specify long sequences of edits in one go,
|
||||
like changing a majority of lines while refactoring a large file.
|
||||
3. Sonnet tends to quote large chunks of a
|
||||
file when performing a SEARCH & REPLACE edits.
|
||||
Beyond token limits, this is very wasteful.
|
||||
|
||||
## Good problems
|
||||
|
||||
Problems (1) and (2) are "good problems"
|
||||
in the sense that Sonnet is
|
||||
able to write more high quality code than any other model!
|
||||
We just don't want it to be interrupted prematurely
|
||||
by the 4k output limit.
|
||||
|
||||
Aider now allows Sonnet to return code in multiple 4k token
|
||||
responses.
|
||||
Aider seamlessly combines them so that Sonnet can return arbitrarily
|
||||
long responses.
|
||||
This gets all the upsides of Sonnet's prolific coding skills,
|
||||
without being constrained by the 4k output token limit.
|
||||
|
||||
|
||||
## Wasting tokens
|
||||
|
||||
Problem (3) is more complicated, as Sonnet isn't just
|
||||
being stopped early -- it's actually wasting a lot
|
||||
of tokens, time and money.
|
||||
|
||||
Faced with a few small changes spread far apart in
|
||||
a source file,
|
||||
Sonnet would often prefer to do one giant SEARCH/REPLACE
|
||||
operation of almost the entire file.
|
||||
It would be far faster and less expensive to instead
|
||||
do a few surgical edits.
|
||||
|
||||
Aider now prompts Sonnet to discourage these long-winded
|
||||
SEARCH/REPLACE operations
|
||||
and promotes much more concise edits.
|
||||
|
||||
|
||||
## Aider with Sonnet
|
||||
|
||||
[The latest release of aider](https://aider.chat/HISTORY.html#aider-v0410)
|
||||
has specialized support for Claude 3.5 Sonnet:
|
||||
|
||||
- Aider allows Sonnet to produce as much code as it wants,
|
||||
by automatically and seamlessly spreading the response
|
||||
out over a sequence of 4k token API responses.
|
||||
- Aider carefully prompts Sonnet to be concise when proposing
|
||||
code edits.
|
||||
This reduces Sonnet's tendency to waste time, tokens and money
|
||||
returning large chunks of unchanging code.
|
||||
- Aider now uses Claude 3.5 Sonnet by default if the `ANTHROPIC_API_KEY` is set in the environment.
|
||||
|
||||
See
|
||||
[aider's install instructions](https://aider.chat/docs/install.html)
|
||||
for more details, but
|
||||
you can get started quickly with aider and Sonnet like this:
|
||||
|
||||
```
|
||||
$ pip install aider-chat
|
||||
|
||||
$ export ANTHROPIC_API_KEY=<key> # Mac/Linux
|
||||
$ setx ANTHROPIC_API_KEY <key> # Windows
|
||||
|
||||
$ aider
|
||||
```
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue