This commit is contained in:
Paul Gauthier 2024-05-23 09:47:23 -07:00
parent 15669b7ae8
commit 210aeb6133

View file

@ -57,9 +57,11 @@ alternating between using aider with GPT-4o and Opus.
- If no plausible solution is found after six tries, the harness picks the solution - If no plausible solution is found after six tries, the harness picks the solution
with the least amount of edit/lint/test problems. with the least amount of edit/lint/test problems.
It's important to be clear that during benchmarking It's important to be clear that
*aider only had access to the pre-existing tests in the problem's repo*. *aider and the benchmark harness
It could not see or run the held out "acceptance tests" that are used later to see if the only had access to the pre-existing tests in each problem's repo*.
They could not see or run the held out "acceptance tests" that are used
after benchmarking to see if the
SWE Bench problem was correctly resolved. SWE Bench problem was correctly resolved.
The benchmarking process was similar to how a developer might use aider to The benchmarking process was similar to how a developer might use aider to
@ -69,27 +71,33 @@ resolve a GitHub issue:
tells aider they want to accept every suggestion tells aider they want to accept every suggestion
and to use pytest to run tests. and to use pytest to run tests.
- `aider --yes --test-cmd pytest` - `aider --yes --test-cmd pytest`
- Paste the URL or text of a GitHub issue into the chat. Aider will pull in the URL's content and then try and solve the issue. - They could start the chat by pasting in the URL or text of a GitHub issue.
Aider will pull in the URL's content and then try and solve the issue.
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time. - If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git), [Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
so it's always easy to revert AI changes that don't pan out. so it's always easy to revert AI changes that don't pan out.
Outside a benchmark setting, it's probably Outside a benchmark setting, it's probably
unwise to let *any* AI agent run unsupervised on your code base. unwise or at least highly inefficient
Aider is intended to be used as an interactive pair-programming chat, to let *any* AI agent run unsupervised on your code base.
where the user participates to direct aider's work and approve suggestions. The reason aider is intended to be used interactively
is so that the user can participate and direct aider's work and approve suggestions.
This way the user can offer immediate feedback or corrections if their initial This way the user can offer immediate feedback or corrections if their initial
instructions turn out to be ambiguous, instructions turn out to be ambiguous,
or if the AI starts going down a wrong path. or if the AI starts going down a wrong path.
## Aider with GPT-4o alone was SOTA ## Aider with GPT-4o alone was SOTA
Running the SWE Bench Lite benchmark using aider with just GPT-4o Running the benchmark harness
only using aider with GPT-4o to find plausible solutions
achieved a score of 25.0%. achieved a score of 25.0%.
This was itself a state-of-the-art result, before being surpassed by the main This was itself a state-of-the-art result, before being surpassed by the main
result being reported here result being reported here
that used aider with both GPT-4o & Opus. that used aider with both GPT-4o & Opus.
As noted below, a single attempt using Aider with GPT-4o tied
the current top entry on the leader.
## Aider with GPT-4o & Opus ## Aider with GPT-4o & Opus
The benchmark harness alternated between running aider with GPT-4o and Opus. The benchmark harness alternated between running aider with GPT-4o and Opus.
@ -101,7 +109,7 @@ The table below breaks down the 79 solutions that were ultimately
verified as correctly resolving their issue. verified as correctly resolving their issue.
Some noteworthy observations: Some noteworthy observations:
- Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. - *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results. - Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions. These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem. - A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
@ -151,7 +159,8 @@ Aider instead uses a
[repository map](https://aider.chat/2023/10/22/repomap.html) [repository map](https://aider.chat/2023/10/22/repomap.html)
to help the LLM understand the to help the LLM understand the
layout, code structure, and content of a git repo. layout, code structure, and content of a git repo.
The repo map is created from the code's AST and call graph The repo map is created from the code's
abstract syntax tree and call graph
to provide a compact and powerful summary of the entire code base. to provide a compact and powerful summary of the entire code base.
The map is constantly The map is constantly
tailored to show tailored to show
@ -221,7 +230,7 @@ When aider completes, it returns an editing outcome that indicates
whether it was able to successfully complete all edits. whether it was able to successfully complete all edits.
The benchmark harness used this editing status as The benchmark harness used this editing status as
one criteria to determine if aider has one criteria to determine if aider has
created a plausible soultion. created a plausible solution.
## Linting and fixing ## Linting and fixing
@ -233,7 +242,7 @@ after every LLM edit and offers to automatically fix
any problems. any problems.
Aider shows linting errors to the LLM in a novel format, Aider shows linting errors to the LLM in a novel format,
using the abstract syntax tree (AST) to display relevant code context for each using the abstract syntax tree to display relevant code context for each
error. error.
This context increases the ability of the LLM to understand the problem and This context increases the ability of the LLM to understand the problem and
make the correct changes to resolve it. make the correct changes to resolve it.
@ -283,7 +292,7 @@ indicates if it was able to ultimately produce
code without any outstanding linting errors. code without any outstanding linting errors.
The benchmark harness used this status as The benchmark harness used this status as
one of the criteria to determine if aider has one of the criteria to determine if aider has
created a plausible soultion. created a plausible solution.
## Testing and fixing ## Testing and fixing
@ -371,9 +380,17 @@ The benchmark harness produces a candidate solution for each of the 300
SWE Bench Lite instances and saves it as the `model_patch`. SWE Bench Lite instances and saves it as the `model_patch`.
A separate evaluation script A separate evaluation script
tests each of these results with the acceptance tests. tests each of these solutions with the full test suite
It verifies that they pass as expected from a correct solution, like including the held out acceptance tests.
the "gold" patch developed by a human to solve the issue. For this final acceptance testing, any edits that aider made to tests
are discarded.
This ensures that the full, correct test suite is used for acceptance testing.
The evaluation script compares the test results
with results from testing
the "gold" patch that was developed by a human to correctly solve the issue.
If they match, the candidate solution has correctly resolved the issue.
These so called `test_patch` acceptance tests are only ever run outside of aider These so called `test_patch` acceptance tests are only ever run outside of aider
and the benchmark harness, and only to compute the number of and the benchmark harness, and only to compute the number of