---
title: Aider scores 26.3% on SWE Bench Lite
excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
highlight_image: /assets/swe_bench_lite.jpg
draft: true
---
# Aider scores 26.3% on SWE Bench Lite
[Aider scored 26.3%](https://github.com/swe-bench/experiments/pull/7)
on the
[SWE Bench Lite benchmark](https://www.swebench.com),
achieving a state-of-the-art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
The best result reported elsewhere seems to be
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
[](https://aider.chat/assets/swe_bench_lite.svg)
## Interactive, not agentic
Aider achieved this result mainly through its focus on static code analysis,
reliable LLM code editing,
and pragmatic features for AI pair programming.
Aider intentionally has quite limited and narrow "agentic behavior"
to avoid long delays, high token costs
and the need for users to repeatedly code review incorrect solutions.
It's also worth noting that aider currently does not use
RAG, vector search, tools or give the LLM access to search the web
or unilaterally execute code.
Aider is first and foremost an interactive tool for engineers to get real work done in
real code bases using a chat interface.
Aider provides a pair programming experience where users can ask for a change
and see the edits performed in real-time.
Aider can also offer additional help like fixing lint or test errors,
but the user is always in full interactive control.
This lets them quickly steer misunderstandings back on course and
avoid wasting time and token costs.
## Benchmark methodology
For the benchmark,
aider was launched in each problem's git repository
with the problem statement
submitted as the opening chat message from "the user."
After that aider runs as normal, with the following modifications:
- Aider's suggestions were always accepted without user approval.
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
Plausibly correct means that aider reported that it had successfully edited the repo
without causing syntax errors or breaking any *pre-existing* tests.
- If the solution isn't plausible, the harness launches aider to try again from scratch,
alternating between using aider with GPT-4o and Opus.
- If no plausible solution is found after six tries, the harness picks the solution
with the fewest edit/lint/test problems.
It's important to be clear that
*aider and the benchmark harness
only had access to the pre-existing tests in each problem's repo*.
The held out "acceptance tests" were *only* used
after benchmarking to compute statistics on which problems aider
correctly resolved.
The benchmarking process was similar to how a developer might use aider to
resolve a GitHub issue:
- They could launch aider in their repo with the command below, which
tells aider they want to accept every suggestion
and to use pytest to run tests.
- `aider --yes --test-cmd pytest`
- They could start the chat by pasting in the URL or text of a GitHub issue.
Aider will pull in the URL's content and then try and solve the issue.
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
so it's always easy to revert AI changes that don't pan out.
Outside a benchmark setting, it's probably
unwise or at least highly inefficient
to let *any* AI agent run unsupervised on your code base.
The reason aider is intended to be used interactively
is so that the user can participate and direct aider's work and approve suggestions.
This way the user can offer immediate feedback or corrections if their initial
instructions turn out to be ambiguous,
or if the AI starts going down a wrong path.
## Aider with GPT-4o alone was SOTA
Running the benchmark harness
only using aider with GPT-4o to find plausible solutions
achieved a score of 25.0%.
This was itself a state-of-the-art result, before being surpassed by the main
result being reported here
that used aider with both GPT-4o & Opus.
As noted below, a single attempt using Aider with GPT-4o tied
the current top entry on the leaderboard.
## Aider with GPT-4o & Opus
The benchmark harness alternated between running aider with GPT-4o and Opus.
The harness proceeded in a fixed order, always starting with GPT-4o and
then alternating with Opus until a plausible solution was found for each
problem.
The table below breaks down the plausible solutions that
were found for the 300 problems.
It also provides details on the 79 that were ultimately
verified as correctly resolving their issue.
Some noteworthy observations:
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
| Attempt | Agent |Number of
plausible
solutions|Percent of
plausible
solutions| Number of
correctly
resolved
solutions | Percent of
correctly
resolved
solutions | Score on
SWE Bench
Lite
(resolved/300) |
|:--------:|------------|---------:|---------:|----:|---:|--:|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
If we break down the solutions solely by model,
we can see that aider with GPT-4o outperforms Opus.
This isn't a fair and direct comparison, because GPT-4o always took the first
turn and therefore got first crack at all the "easiest" problems.
Aider with Opus only ever saw problems that GPT-4o failed to
find plausible solutions for on its first try.
Aider with GPT-4o was producing higher quality plausible solutions,
with a greater chance of going on to be accepted as resolving the issue.
Again, this is biased by the turn ordering.
But other anecdotal evidence from earlier runs of the benchmark
also supports the observation that aider with GPT-4o is significantly stronger than Opus
for this benchmark.
| Agent | Number of
plausible
solutions | Number of
correctly
resolved
solutions | Percent of
plausible
which
correctly
resolved
|
|------------|---------:|---------:|---:|
| Aider with GPT-4o | 239 | 66 |27.6% |
| Aider with Opus | 61 | 13 |21.3% |
| **Total** | **300** | **79** |**26.3%** |
## Repository map, not RAG
The crucial first step in solving a SWE Bench problem is figuring out
which parts of the repo are relevant and which files need to be edited.
Most coding agents use some combination of RAG, vector search
and providing the LLM with
tools to interactively explore the code base.
Aider instead uses a
[repository map](https://aider.chat/2023/10/22/repomap.html)
to help the LLM understand the
layout, code structure, and content of a git repo.
The repo map is created through static analysis of the code's
abstract syntax tree and call graph
to provide a compact and powerful summary of the entire code base.
The map is constantly
tailored to show
repo context that is relevant to the current state of the chat conversation.
This is done by performing a graph optimization on the code's call graph.
When the user asks for a change to their code, the LLM can use the repo map
to decide which files to edit.
The LLM simply returns a normal text response explaining which files
it needs to edit and why.
Aider notices when the LLM mentions filenames from the repo,
and asks the user if they should be added to the chat.
Adding a file to the chat allows the LLM to see the full contents
of the file and edit it.