mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-06 04:35:00 +00:00
copy
This commit is contained in:
parent
9f2554fed7
commit
b67891e7f8
1 changed files with 35 additions and 42 deletions
|
@ -10,7 +10,7 @@ draft: true
|
||||||
Aider scored 26.3%
|
Aider scored 26.3%
|
||||||
on the
|
on the
|
||||||
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
|
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
|
||||||
The current top leaderboard entry is 20.33%
|
The current top leaderboard entry is 20.3%
|
||||||
from Amazon Q Developer Agent.
|
from Amazon Q Developer Agent.
|
||||||
The best result reported elsewhere online seems to be
|
The best result reported elsewhere online seems to be
|
||||||
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
|
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
|
||||||
|
@ -31,12 +31,13 @@ etc.
|
||||||
|
|
||||||
Aider is first and foremost a tool for engineers to get real work done in
|
Aider is first and foremost a tool for engineers to get real work done in
|
||||||
real code bases through a pair programming chat style interface.
|
real code bases through a pair programming chat style interface.
|
||||||
|
When a user asks aider for a change, they see the edits performed in real-time
|
||||||
|
and aider may also then offer additional
|
||||||
|
help like fixing lint or test errors.
|
||||||
In normal use, the user is in full interactive control.
|
In normal use, the user is in full interactive control.
|
||||||
This lets them quickly steer misunderstandings back on course and
|
This lets them quickly steer misunderstandings back on course and
|
||||||
avoid wasted time, code reviews and token costs.
|
avoid wasted time, code reviews and token costs.
|
||||||
When a user asks aider for a change, they see the edits performed in real-time.
|
|
||||||
Aider may also then offer additional
|
|
||||||
help like fixing lint or test errors.
|
|
||||||
|
|
||||||
## Methodology
|
## Methodology
|
||||||
|
|
||||||
|
@ -46,29 +47,21 @@ with the problem statement
|
||||||
submitted as the opening chat message from "the user".
|
submitted as the opening chat message from "the user".
|
||||||
After that aider runs as normal, with the following modifications:
|
After that aider runs as normal, with the following modifications:
|
||||||
|
|
||||||
- Aider's suggestions were always accepted.
|
- Aider's suggestions were always accepted without user approval.
|
||||||
When chatting, aider will suggest which files in the repo may need to be edited based on
|
|
||||||
the conversation.
|
|
||||||
It will offer to lint code that has been edited,
|
|
||||||
and to fix any issues uncovered.
|
|
||||||
Aider has workflows to run the repo's test suite and resolve failing tests.
|
|
||||||
Normally the user is asked to approved such suggestions, but
|
|
||||||
they were always accepted during the benchmark.
|
|
||||||
- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
|
- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
|
||||||
Plausible means that aider successfully edited the repo without breaking anything.
|
Plausibly correct means that aider concluded that it had successfully edited the repo
|
||||||
As mentioned, aider has integrated support for linting and testing,
|
without causing syntax errors or breaking any *pre-existing* tests.
|
||||||
so the harness just looks at aider's completion status to see if those
|
- If the solution isn't plausible, the harness launches aider to try again from scratch
|
||||||
operations finished clean.
|
alternating between using aider with GPT-4o and Opus.
|
||||||
Note that *aider only had access to the pre-existing tests in the repo*,
|
- If no plausible solution is found after six tries, the harness picks the solution
|
||||||
not the held out "acceptance tests" that are used later to see if the
|
|
||||||
SWE Bench problem was correctly resolved.
|
|
||||||
- If the solution isn't plausible, the harness launches aider to try again from scratch.
|
|
||||||
The harness alternates between running aider with GPT-4o and Opus up to three times each,
|
|
||||||
until it finds a plausible solution.
|
|
||||||
- If no plausible solution is found, the harness picks the solution
|
|
||||||
with the least amount of edit/lint/test problems.
|
with the least amount of edit/lint/test problems.
|
||||||
|
|
||||||
This is all roughly equivalent to a user:
|
It's important to be clear that during benchmarking
|
||||||
|
*aider only had access to the pre-existing tests in the repo*.
|
||||||
|
It could not see or run the held out "acceptance tests" that are used later to see if the
|
||||||
|
SWE Bench problem was correctly resolved.
|
||||||
|
|
||||||
|
The benchmarking process can be thought of as similar to a user:
|
||||||
|
|
||||||
- Launching aider in their repo with the something like command below, which
|
- Launching aider in their repo with the something like command below, which
|
||||||
tells aider to say yes to every suggestion and use pytest to run tests.
|
tells aider to say yes to every suggestion and use pytest to run tests.
|
||||||
|
@ -77,7 +70,7 @@ tells aider to say yes to every suggestion and use pytest to run tests.
|
||||||
- `/web https://github.com/django/django/issues/XXX`
|
- `/web https://github.com/django/django/issues/XXX`
|
||||||
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
|
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
|
||||||
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||||
so it's always easy to undo/revert AI changes that don't pan out.
|
so it's always easy to revert AI changes that don't pan out.
|
||||||
|
|
||||||
Of course, outside a benchmark setting it's probably
|
Of course, outside a benchmark setting it's probably
|
||||||
unwise to let *any* AI agent run unsupervised on your code base.
|
unwise to let *any* AI agent run unsupervised on your code base.
|
||||||
|
@ -93,7 +86,7 @@ Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
|
||||||
achieved a score of 25%.
|
achieved a score of 25%.
|
||||||
This was itself a state of the art result, before being surpassed by the main
|
This was itself a state of the art result, before being surpassed by the main
|
||||||
result being reported here
|
result being reported here
|
||||||
that uses aider with both GPT-4o & Opus.
|
that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
## GPT-4o vs Opus
|
## GPT-4o vs Opus
|
||||||
|
|
||||||
|
@ -102,36 +95,36 @@ The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||||
then alternating with Opus until a plausible solution was found.
|
then alternating with Opus until a plausible solution was found.
|
||||||
|
|
||||||
The table below breaks down the 79 solutions which were ultimately
|
The table below breaks down the 79 solutions which were ultimately
|
||||||
verified as correctly resolving their task.
|
verified as correctly resolving their issue.
|
||||||
Some noteworthy observations:
|
Some noteworthy observations:
|
||||||
|
|
||||||
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
|
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
|
||||||
- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
|
- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
|
||||||
- A long tail of solutions continued to be found by both models including on the final, 6th attempt.
|
- A long tail of solutions continued to be found by both models including one on the final, sixth attempt of that problem.
|
||||||
|
|
||||||
|
|
||||||
| Attempt | Model | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved |
|
| Attempt | Agent | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved |
|
||||||
|:--------:|------------|---------:|---------:|----:|
|
|:--------:|------------|---------:|---------:|----:|
|
||||||
| 1 | GPT-4o | 61 | 77.2 | 77.2
|
| 1 | Aider with GPT-4o | 61 | 77.2 | 77.2
|
||||||
| 2 | Opus | 10 | 12.7 | 89.9
|
| 2 | Aider with Opus | 10 | 12.7 | 89.9
|
||||||
| 3 | GPT-4o | 3 | 3.8 | 93.7
|
| 3 | Aider with GPT-4o | 3 | 3.8 | 93.7
|
||||||
| 4 | Opus | 2 | 2.5 | 96.2
|
| 4 | Aider with Opus | 2 | 2.5 | 96.2
|
||||||
| 5 | GPT-4o | 2 | 2.5 | 98.7
|
| 5 | Aider with GPT-4o | 2 | 2.5 | 98.7
|
||||||
| 6 | Opus | 1 | 1.3 | 100.0
|
| 6 | Aider with Opus | 1 | 1.3 | 100.0
|
||||||
|**Total**| | **79** | **100%** | **100%** |
|
|**Total**| | **79** | **100%** | **100%** |
|
||||||
|
|
||||||
If we breakdown correct solutions purely by model,
|
If we breakdown correct solutions purely by model,
|
||||||
we can see that GPT-4o dominates.
|
we can see that GPT-4o dominates.
|
||||||
This isn't a fair comparison, because GPT-4o always took the first
|
This isn't a fair and direct comparison, because GPT-4o always took the first
|
||||||
attempt at solving.
|
turn at solving.
|
||||||
But anecdotal evidence from early runs of the benchmark
|
But anecdotal evidence from earlier runs of the benchmark
|
||||||
supports the observation that GPT-4o is significantly stronger than Opus
|
supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||||
for this endeavor.
|
for this endeavor.
|
||||||
|
|
||||||
| Model | Number resolved | Percent of resolved |
|
| Agent | Number resolved | Percent of resolved |
|
||||||
|------------|---------:|---------:|
|
|------------|---------:|---------:|
|
||||||
| GPT-4o | 66 | 83.5 |
|
| Aider with GPT-4o | 66 | 83.5 |
|
||||||
| Opus | 13 | 16.5 |
|
| Aider with Opus | 13 | 16.5 |
|
||||||
|**Total**| **79** | **100%** |
|
|**Total**| **79** | **100%** |
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue