mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-06 12:45:00 +00:00
copy
This commit is contained in:
parent
15669b7ae8
commit
210aeb6133
1 changed files with 33 additions and 16 deletions
|
@ -57,9 +57,11 @@ alternating between using aider with GPT-4o and Opus.
|
||||||
- If no plausible solution is found after six tries, the harness picks the solution
|
- If no plausible solution is found after six tries, the harness picks the solution
|
||||||
with the least amount of edit/lint/test problems.
|
with the least amount of edit/lint/test problems.
|
||||||
|
|
||||||
It's important to be clear that during benchmarking
|
It's important to be clear that
|
||||||
*aider only had access to the pre-existing tests in the problem's repo*.
|
*aider and the benchmark harness
|
||||||
It could not see or run the held out "acceptance tests" that are used later to see if the
|
only had access to the pre-existing tests in each problem's repo*.
|
||||||
|
They could not see or run the held out "acceptance tests" that are used
|
||||||
|
after benchmarking to see if the
|
||||||
SWE Bench problem was correctly resolved.
|
SWE Bench problem was correctly resolved.
|
||||||
|
|
||||||
The benchmarking process was similar to how a developer might use aider to
|
The benchmarking process was similar to how a developer might use aider to
|
||||||
|
@ -69,27 +71,33 @@ resolve a GitHub issue:
|
||||||
tells aider they want to accept every suggestion
|
tells aider they want to accept every suggestion
|
||||||
and to use pytest to run tests.
|
and to use pytest to run tests.
|
||||||
- `aider --yes --test-cmd pytest`
|
- `aider --yes --test-cmd pytest`
|
||||||
- Paste the URL or text of a GitHub issue into the chat. Aider will pull in the URL's content and then try and solve the issue.
|
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||||
|
Aider will pull in the URL's content and then try and solve the issue.
|
||||||
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
|
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
|
||||||
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||||
so it's always easy to revert AI changes that don't pan out.
|
so it's always easy to revert AI changes that don't pan out.
|
||||||
|
|
||||||
Outside a benchmark setting, it's probably
|
Outside a benchmark setting, it's probably
|
||||||
unwise to let *any* AI agent run unsupervised on your code base.
|
unwise or at least highly inefficient
|
||||||
Aider is intended to be used as an interactive pair-programming chat,
|
to let *any* AI agent run unsupervised on your code base.
|
||||||
where the user participates to direct aider's work and approve suggestions.
|
The reason aider is intended to be used interactively
|
||||||
|
is so that the user can participate and direct aider's work and approve suggestions.
|
||||||
This way the user can offer immediate feedback or corrections if their initial
|
This way the user can offer immediate feedback or corrections if their initial
|
||||||
instructions turn out to be ambiguous,
|
instructions turn out to be ambiguous,
|
||||||
or if the AI starts going down a wrong path.
|
or if the AI starts going down a wrong path.
|
||||||
|
|
||||||
## Aider with GPT-4o alone was SOTA
|
## Aider with GPT-4o alone was SOTA
|
||||||
|
|
||||||
Running the SWE Bench Lite benchmark using aider with just GPT-4o
|
Running the benchmark harness
|
||||||
|
only using aider with GPT-4o to find plausible solutions
|
||||||
achieved a score of 25.0%.
|
achieved a score of 25.0%.
|
||||||
This was itself a state-of-the-art result, before being surpassed by the main
|
This was itself a state-of-the-art result, before being surpassed by the main
|
||||||
result being reported here
|
result being reported here
|
||||||
that used aider with both GPT-4o & Opus.
|
that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
|
As noted below, a single attempt using Aider with GPT-4o tied
|
||||||
|
the current top entry on the leader.
|
||||||
|
|
||||||
## Aider with GPT-4o & Opus
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||||
|
@ -101,7 +109,7 @@ The table below breaks down the 79 solutions that were ultimately
|
||||||
verified as correctly resolving their issue.
|
verified as correctly resolving their issue.
|
||||||
Some noteworthy observations:
|
Some noteworthy observations:
|
||||||
|
|
||||||
- Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
||||||
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
|
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
|
||||||
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
||||||
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
||||||
|
@ -151,7 +159,8 @@ Aider instead uses a
|
||||||
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
||||||
to help the LLM understand the
|
to help the LLM understand the
|
||||||
layout, code structure, and content of a git repo.
|
layout, code structure, and content of a git repo.
|
||||||
The repo map is created from the code's AST and call graph
|
The repo map is created from the code's
|
||||||
|
abstract syntax tree and call graph
|
||||||
to provide a compact and powerful summary of the entire code base.
|
to provide a compact and powerful summary of the entire code base.
|
||||||
The map is constantly
|
The map is constantly
|
||||||
tailored to show
|
tailored to show
|
||||||
|
@ -221,7 +230,7 @@ When aider completes, it returns an editing outcome that indicates
|
||||||
whether it was able to successfully complete all edits.
|
whether it was able to successfully complete all edits.
|
||||||
The benchmark harness used this editing status as
|
The benchmark harness used this editing status as
|
||||||
one criteria to determine if aider has
|
one criteria to determine if aider has
|
||||||
created a plausible soultion.
|
created a plausible solution.
|
||||||
|
|
||||||
## Linting and fixing
|
## Linting and fixing
|
||||||
|
|
||||||
|
@ -233,7 +242,7 @@ after every LLM edit and offers to automatically fix
|
||||||
any problems.
|
any problems.
|
||||||
|
|
||||||
Aider shows linting errors to the LLM in a novel format,
|
Aider shows linting errors to the LLM in a novel format,
|
||||||
using the abstract syntax tree (AST) to display relevant code context for each
|
using the abstract syntax tree to display relevant code context for each
|
||||||
error.
|
error.
|
||||||
This context increases the ability of the LLM to understand the problem and
|
This context increases the ability of the LLM to understand the problem and
|
||||||
make the correct changes to resolve it.
|
make the correct changes to resolve it.
|
||||||
|
@ -283,7 +292,7 @@ indicates if it was able to ultimately produce
|
||||||
code without any outstanding linting errors.
|
code without any outstanding linting errors.
|
||||||
The benchmark harness used this status as
|
The benchmark harness used this status as
|
||||||
one of the criteria to determine if aider has
|
one of the criteria to determine if aider has
|
||||||
created a plausible soultion.
|
created a plausible solution.
|
||||||
|
|
||||||
## Testing and fixing
|
## Testing and fixing
|
||||||
|
|
||||||
|
@ -371,9 +380,17 @@ The benchmark harness produces a candidate solution for each of the 300
|
||||||
SWE Bench Lite instances and saves it as the `model_patch`.
|
SWE Bench Lite instances and saves it as the `model_patch`.
|
||||||
|
|
||||||
A separate evaluation script
|
A separate evaluation script
|
||||||
tests each of these results with the acceptance tests.
|
tests each of these solutions with the full test suite
|
||||||
It verifies that they pass as expected from a correct solution, like
|
including the held out acceptance tests.
|
||||||
the "gold" patch developed by a human to solve the issue.
|
For this final acceptance testing, any edits that aider made to tests
|
||||||
|
are discarded.
|
||||||
|
This ensures that the full, correct test suite is used for acceptance testing.
|
||||||
|
The evaluation script compares the test results
|
||||||
|
with results from testing
|
||||||
|
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||||
|
If they match, the candidate solution has correctly resolved the issue.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
These so called `test_patch` acceptance tests are only ever run outside of aider
|
These so called `test_patch` acceptance tests are only ever run outside of aider
|
||||||
and the benchmark harness, and only to compute the number of
|
and the benchmark harness, and only to compute the number of
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue