mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-28 08:14:59 +00:00
Added draft post
This commit is contained in:
parent
d94da4f809
commit
db30e5da2b
1 changed files with 364 additions and 0 deletions
364
_posts/2024-05-22-swe-bench-lite.md
Normal file
364
_posts/2024-05-22-swe-bench-lite.md
Normal file
|
@ -0,0 +1,364 @@
|
|||
---
|
||||
title: Aider scores SOTA 26.3% on SWE Bench Lite
|
||||
excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
|
||||
draft: true
|
||||
---
|
||||
|
||||
# Aider scores SOTA 26.3% on SWE Bench Lite
|
||||
|
||||
[Aider scored 26.3%]()
|
||||
on the
|
||||
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
|
||||
The current top leaderboard entry is 20.33%
|
||||
from Amazon Q Developer Agent.
|
||||
The best result reported elsewhere online seems to be
|
||||
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
|
||||
|
||||
Aider achieved this result mainly through its focus on static code analysis,
|
||||
reliable LLM code editing
|
||||
and pragmatic workflows for interactive pair programming with AI.
|
||||
Aider intentionally has quite limited and narrow "agentic behavior":
|
||||
it doesn't require a highly detailed upfront "spec" from the user,
|
||||
use RAG or vector search, farm out sub-problems to an army of LLMs,
|
||||
allow the LLM to use tools
|
||||
or perform web searches,
|
||||
etc.
|
||||
|
||||
Aider is first and foremost a tool for engineers to get real work done in
|
||||
real code bases through a pair programming chat style interface.
|
||||
In normal use, the user is in full interactive control.
|
||||
This lets them quickly steer misunderstandings back on course and
|
||||
avoid wasted time, code reviews and token costs.
|
||||
When a user asks aider for a change, they see the edits performed in real-time.
|
||||
Aider may also then offer additional
|
||||
help like fixing lint or test errors.
|
||||
|
||||
For the benchmark,
|
||||
aider was launched in each problem's git repository
|
||||
with the problem statement
|
||||
submitted as the opening chat message from "the user".
|
||||
After that aider runs as normal, with the following modifications:
|
||||
|
||||
- Aider's suggestions were always accepted.
|
||||
When chatting, aider will suggest which files in the repo may need to be edited based on
|
||||
the conversation.
|
||||
It will offer to lint code that has been edited,
|
||||
and to fix any issues uncovered.
|
||||
Aider has workflows to run the repo's test suite and resolve failing tests.
|
||||
Normally the user is asked to approved such suggestions, but
|
||||
they were always accepted during the benchmark.
|
||||
- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
|
||||
Plausible means that aider successfully edited the repo without breaking anything.
|
||||
As mentioned, aider has integrated support for linting and testing,
|
||||
so the harness just looks at aider's completion status to see if those
|
||||
operations finished clean.
|
||||
Note that *aider only had access to the pre-existing tests in the repo*,
|
||||
not the held out "acceptance tests" that are used later to see if the
|
||||
SWE Bench problem was correctly resolved.
|
||||
- If the solution isn't plausible, the harness launches aider to try again from scratch.
|
||||
The harness alternates between running aider with GPT-4o and Opus up to three times each,
|
||||
until it finds a plausible solution.
|
||||
- If no plausible solution is found, the harness picks the solution
|
||||
with the least amount of edit/lint/test problems.
|
||||
|
||||
This is all roughly equivalent to a user:
|
||||
|
||||
- Launching aider in their repo with the something like command below, which
|
||||
tells aider to say yes to every suggestion and use pytest to run tests.
|
||||
- `aider --yes --test-cmd pytest`
|
||||
- Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like:
|
||||
- `/web https://github.com/django/django/issues/XXX`
|
||||
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
|
||||
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||
so it's always easy to undo/revert AI changes that don't pan out.
|
||||
|
||||
Of course, outside a benchmark setting it's probably
|
||||
unwise to let *any* AI agent run unsupervised on your code base.
|
||||
Aider is intended to be used as an interactive pair-programming chat,
|
||||
where the user participates to direct aider's work and approve suggestions.
|
||||
This way the user can offer immediate feedback or corrections if their initial
|
||||
instructions turn out to be ambiguous,
|
||||
or if the AI starts going down a wrong path.
|
||||
|
||||
## Aider with GPT-4o alone was SOTA
|
||||
|
||||
Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
|
||||
achieved a score of 25%.
|
||||
This was itself a state of the art result, before being surpassed by the main
|
||||
result being reported here
|
||||
that uses aider with both GPT-4o & Opus.
|
||||
|
||||
## GPT-4o vs Opus
|
||||
|
||||
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||
The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||
then alternating with Opus until a plausible solution was found.
|
||||
|
||||
The table below breaks down the 79 solutions which were ultimately
|
||||
verified as correctly resolving their task.
|
||||
Some noteworthy observations:
|
||||
|
||||
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
|
||||
- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
|
||||
- A long tail of solutions continued to be found by both models including on the final, 6th attempt.
|
||||
|
||||
|
||||
| Attempt | Model | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved |
|
||||
|:--------:|------------|---------:|---------:|----:|
|
||||
| 1 | GPT-4o | 61 | 77.2 | 77.2
|
||||
| 2 | Opus | 10 | 12.7 | 89.9
|
||||
| 3 | GPT-4o | 3 | 3.8 | 93.7
|
||||
| 4 | Opus | 2 | 2.5 | 96.2
|
||||
| 5 | GPT-4o | 2 | 2.5 | 98.7
|
||||
| 6 | Opus | 1 | 1.3 | 100.0
|
||||
|**Total**| | **79** | **100%** | **100%** |
|
||||
|
||||
If we just look at which models produced correct solutions,
|
||||
we can see that GPT-4o dominates.
|
||||
This isn't a fair comparison, because GPT-4o always took the first
|
||||
attempt at solving.
|
||||
But anecdotal evidence from early runs of the benchmark
|
||||
supports the observation that GPT-4o is significantly stronger than Opus
|
||||
for this endeavor.
|
||||
|
||||
| Model | Number resolved | Percent of resolved |
|
||||
|------------|---------:|---------:|
|
||||
| GPT-4o | 66 | 83.5 |
|
||||
| Opus | 13 | 16.5 |
|
||||
|**Total**| **79** | **100%** |
|
||||
|
||||
|
||||
## Repository map, not RAG
|
||||
|
||||
The crucial first step in solving a SWE Bench problem is figuring out
|
||||
which parts of the repo are relevant and which files need to be edited.
|
||||
Most coding agents use some combination of RAG, vector search
|
||||
and arming the LLM with
|
||||
tools to interactively explore the code base.
|
||||
|
||||
Aider instead uses a
|
||||
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
||||
to help the LLM understand the
|
||||
layout, code structure and content of a git repo.
|
||||
The repo map is created from the code's AST and call graph
|
||||
to provide a compact and powerful summary of the entire code base.
|
||||
The map is constantly
|
||||
tailored to show
|
||||
repo context that is relevant to the current state of the chat conversation.
|
||||
|
||||
by performing a graph optimization on the code's call graph.
|
||||
|
||||
When the user asks for a change to their code, the LLM uses the repo map
|
||||
to decide which files to edit.
|
||||
The LLM simply returns a normal text response explaining which files
|
||||
it needs to edit and why.
|
||||
Aider notices when the LLM mentions filenames from the repo,
|
||||
and asks the user if they should be added to the chat.
|
||||
Adding a file to the chat allows the LLM to see the full contents
|
||||
of the file and edit it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### Please add a new /factorial/N endpoint.
|
||||
|
||||
To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
|
||||
Please add app.py to the chat so I can proceed with the changes.
|
||||
|
||||
> app.py
|
||||
> Add these files to the chat? yes
|
||||
|
||||
</div>
|
||||
|
||||
This is a convenient and natural workflow for interactive chat,
|
||||
and it worked well for the SWE Bench tasks.
|
||||
Each task comes with a “gold” patch, which was created by a human developer
|
||||
to solve the issue.
|
||||
Aider successfully identified and added the file from the gold patch
|
||||
in 70.3% of the benchmark tasks.
|
||||
|
||||
Of course aider is not able to see or use the gold patch
|
||||
or the files it names in any way.
|
||||
They were only used to compute this statistic after the benchmarking was completed.
|
||||
|
||||
|
||||
## Reliable code editing
|
||||
|
||||
Once files have been selected for editing,
|
||||
the next step is of course to edit the source code to fix the problem.
|
||||
|
||||
Aider has always had a deep focus on ensuring that LLMs can not just write code,
|
||||
but reliably *edit* code.
|
||||
Aider a collection of prompting strategies and code editing backends which have
|
||||
been honed through
|
||||
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
||||
These foundational capabilities help ensure that the LLM can not only code up a solution but
|
||||
also properly integrate it into the existing code base and source files.
|
||||
|
||||
The repository map helps here too, making sure that the LLM
|
||||
can see relevant classes, functions and variables from the entire repo.
|
||||
This helps ensure that the project's existing APIs and conventions are
|
||||
respected when new code is added.
|
||||
|
||||
## Linting and fixing
|
||||
|
||||
[Aider lints code]()
|
||||
after every LLM edit, and offers to automatically fix
|
||||
any linting errors.
|
||||
Aider includes basic linters built with tree-sitter that support
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
These built in linters will detect syntax errors and other fatal problems with the code.
|
||||
|
||||
Users can also configure aider to use their preferred linters.
|
||||
This allows aider to check for a larger class of problems, keep the code style
|
||||
aligned with the rest of the repo, etc.
|
||||
But for the benchmark, aider simply used its built-in linters.
|
||||
|
||||
Aider shows linting errors to the LLM in a novel format,
|
||||
using the abstract syntax tree (AST) to display relevant code context for each
|
||||
error.
|
||||
This context increases the ability of the LLM to understand the problem and
|
||||
make the correct changes to resolve it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
> app.py:23:36: F821 undefined name 'num'
|
||||
> app.py:41:16: F541 f-string is missing placeholders
|
||||
>
|
||||
> app.py:
|
||||
> ...⋮...
|
||||
> 6│class LongNum:
|
||||
> 7│ def __init__(self, num):
|
||||
> 8│ """
|
||||
> 9│ Initialize the number.
|
||||
> 10│ """
|
||||
> ...⋮...
|
||||
> 19│ def __str__(self):
|
||||
> 20│ """
|
||||
> 21│ Render the number as a string.
|
||||
> 22│ """
|
||||
> 23█ return str(num)
|
||||
> 24│
|
||||
> 25│
|
||||
> 26│@app.route('/subtract/<int:x>/<int:y>')
|
||||
> ...⋮...
|
||||
> 38│@app.route('/divide/<int:x>/<int:y>')
|
||||
> 39│def divide(x, y):
|
||||
> 40│ if y == 0:
|
||||
> 41█ return f"Error: Cannot divide by zero"
|
||||
> 42│ else:
|
||||
> 43│ result = x / y
|
||||
> 44│ return str(result)
|
||||
> 45│
|
||||
> ...⋮...
|
||||
>
|
||||
> Attempt to fix lint errors? yes
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
## Testing and fixing
|
||||
|
||||
Aider can be configured with the command needed to run tests for a repo.
|
||||
A user working on a python project might do that by launching
|
||||
aider like this:
|
||||
|
||||
```
|
||||
aider --test-cmd pytest
|
||||
```
|
||||
|
||||
The repositories that are used in the SWE Bench problems are large open
|
||||
source projects with extensive existing test suites.
|
||||
A repo's test suite can be run in three ways:
|
||||
|
||||
1. Run tests as they existed before trying to solve the problem, without any changes.
|
||||
2. Run tests after aider has modified the repo.
|
||||
So the pre-existing test cases are still present, but may have been modified by aider.
|
||||
Aider may have also added new tests.
|
||||
3. Run the final "acceptance tests" to judge if the coding agent has
|
||||
successfully resolved the problem.
|
||||
SWE Bench verifies both pre-existing tests and a set of held out acceptance tests
|
||||
(from the so called `test_patch`)
|
||||
to check that the issue is properly resolved. During this final acceptance testing,
|
||||
any aider edits to tests are discard to ensure a faithful test of whether the
|
||||
issue was resolved.
|
||||
|
||||
For the benchmark, aider is configured with a test command that will run the tests
|
||||
as described in (2) above.
|
||||
So testing will fail if aider has broken any pre-existing tests or if any new
|
||||
tests that it created aren't passing.
|
||||
When aider runs a test command, it checks for a non-zero exit status.
|
||||
In this case,
|
||||
aider will automatically
|
||||
share the test output with the LLM and ask it to
|
||||
try and resolve the test failures.
|
||||
|
||||
To be clear, *aider can not run or even see the "acceptance tests"* from the `test_patch`
|
||||
as described in (3).
|
||||
Those tests are only run outside of aider and the benchmark harness,
|
||||
to compute the final benchmark score.
|
||||
|
||||
|
||||
|
||||
## Finding a plausible solution
|
||||
|
||||
As aider executes, it notes the outcome of the editing, linting and testing
|
||||
steps.
|
||||
When aider completes, it returns their final status as either:
|
||||
succeeded with no errors remaining,
|
||||
or ended without resolving all errors.
|
||||
|
||||
The benchmark harness uses these outcomes to determine if it has a plausible
|
||||
solution to the current SWE Bench task.
|
||||
A plausible solution is one where aider
|
||||
returns saying that it
|
||||
edited the repo with no outstanding
|
||||
edit, lint or test errors.
|
||||
In this case, aider's changes are taken as the proposed solution and recorded
|
||||
as the SWE Bench `model_patch` to be evaluated later with the
|
||||
`test_patch` "acceptance tests".
|
||||
|
||||
If the solution is not plausible, another
|
||||
instance of aider is launched again from scratch on the same problem.
|
||||
The harness alternates asking GPT-4o and Opus to solve the problem,
|
||||
and gives each model three attempts -- for a total of six attempts.
|
||||
As soon as a plausible solution is found, it is accepted and the
|
||||
harness moves on to the next SWE Bench instance.
|
||||
|
||||
It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained.
|
||||
|
||||
If all six attempts fail to produce a plausible solution,
|
||||
then the "best" solution available is selected as a the
|
||||
`model_patch`.
|
||||
Which of the non-plausible solutions to use is determined
|
||||
by ignoring the testing outcome
|
||||
and prioritizing solutions in the following order:
|
||||
|
||||
- Pick a solution where editing and linting were completed successfully.
|
||||
- Pick a solution where editing was at least partially successful and linting succeeded.
|
||||
- Pick a solution where editing was successful.
|
||||
- Pick a solution where editing was at least partially successful.
|
||||
|
||||
## Computing the benchmark score
|
||||
|
||||
The benchmark harness produces one "best" solution for each of the 300
|
||||
SWE Bench Lite instances, and saves it as a `model_patch`.
|
||||
A separate evaluation script uses the SWE Bench support code to
|
||||
test each of these results with the acceptance tests.
|
||||
|
||||
These `test_patch` acceptance tests are only ever run outside of aider
|
||||
and the benchmark harness, and only to compute the number of
|
||||
correctly resolved instances.
|
||||
They are never run, used or even visible during the attempts to solve the problems.
|
||||
|
||||
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Much thanks to the team behind the
|
||||
[SWE Bench](https://www.swebench.com)
|
||||
family of AI coding benchmarks.
|
||||
Also thanks to Albert Örwall who has
|
||||
[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
|
||||
making it faster, easier and more reliable to run the acceptance tests.
|
||||
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue