This commit is contained in:
Paul Gauthier 2024-05-31 11:23:13 -07:00
parent 38ac9de678
commit 7fe697c1df

View file

@ -13,6 +13,8 @@ on the main
achieving a state-of-the-art result. achieving a state-of-the-art result.
The current top leaderboard entry is 13.8% The current top leaderboard entry is 13.8%
from Amazon Q Developer Agent. from Amazon Q Developer Agent.
The best result reported elsewhere seems to be
[13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
This is in addition to This is in addition to
[aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html) [aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html)
@ -48,17 +50,17 @@ avoid wasting time and token costs.
## Benchmark methodology ## Benchmark methodology
For the benchmark, Benchmarking was conducted as follows:
aider with GPT-4o was launched in each problem's git repository
with the problem statement
submitted as the opening chat message from "the user."
After that aider runs as normal, with the following modifications:
- Aider's suggestions were always accepted without user approval. - Aider with GPT-4o was launched in each problem's git repository
with the problem statement
submitted as the opening chat message from "the user".
- After that aider ran as normal, except all of aider's
suggestions were always accepted without user approval.
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. - A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
Plausibly correct means that aider reported that it had successfully edited the repo Plausibly correct means that aider reported that it had successfully edited the repo
without causing syntax errors or breaking any *pre-existing* tests. without causing syntax errors or breaking any *pre-existing* tests.
- If the solution isn't plausible, the harness launches aider to try again from scratch, - If the solution from aider with GPT-4o isn't plausible, the harness launches aider to try again from scratch,
this time using Claude 3 Opus. this time using Claude 3 Opus.
- If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems. - If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.
@ -71,7 +73,8 @@ correctly resolved.
This is the same methodology This is the same methodology
that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
The only difference is that at most two tries were attempted instead of six, The only difference is that for this result
at most two tries were attempted instead of six,
due to the increased token costs involved in this benchmark. due to the increased token costs involved in this benchmark.
The SWE Bench problems are more difficult and involve edits to The SWE Bench problems are more difficult and involve edits to
more than one source file, more than one source file,
@ -132,45 +135,83 @@ verified as correctly resolving their issue.
| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% | | 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% |
| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** | | **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
If we break down the solutions solely by model, ## Non-plausible but correct solutions?
we can see that aider with GPT-4o outperforms Opus.
This isn't a fair and direct comparison, because GPT-4o always took the first
turn and therefore got first crack at all the "easiest" problems.
Aider with Opus only ever saw problems that GPT-4o failed to
find proposed solutions for on its first try.
Aider with GPT-4o was producing higher quality proposed solutions, It's worth noting that the first row of the table above
with a greater chance of going on to be accepted as resolving the issue. only scored 15.3% on the benchmark,
Again, this is biased by the turn ordering. which differs from the 17.0% result reported above for aider with just GPT-4o.
But other anecdotal evidence from earlier runs of the benchmark This is because making additional attempts is not guaranteed to
also supports the observation that aider with GPT-4o is significantly stronger than Opus monotonically increase the number of resolved issues.
for this benchmark. Later attempts may propose solutions which
seem "more plausible" than prior attempts,
but which are actually worse solutions.
Luckily the later attempts usually provide a net increase in the overall
number of resolved solutions, as is the case here.
This table breaks down the plausibility of each solution proposed by
aider with GPT-4o and with Opus, as well as whether it was actually
a correct solution.
| Agent | Number&nbsp;of<br>proposed<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>proposed<br>which<br>correctly<br>resolved<br>| |Row|GPT-4o<br>solution<br>plausible?|GPT-4o<br>solution<br>resolved issue?|Opus<br>solution<br>plausible?|Opus<br>solution<br>resolved issue?|Count|
|------------|---------:|---------:|---:| |---:|--:|--:|--:|--:|--:|
| Aider with GPT-4o | 419 | 87 |20.8% | | 1 | plausible | resolved | n/a | n/a | 73 |
| Aider with Opus | 151 | 20 |13.2% | | 2 | plausible | not resolved | n/a | n/a | 181 |
| **Total** | **570** | **107** |**18.8%** | | 3 | non-plausible | resolved | plausible | resolved | 1 |
| 4 | non-plausible | resolved | plausible | not resolved | 2 |
| 5 | non-plausible | resolved | non-plausible | resolved | 16 |
| 6 | non-plausible | resolved | non-plausible | not resolved | 5 |
| 7 | non-plausible | not resolved | plausible | resolved | 12 |
| 8 | non-plausible | not resolved | plausible | not resolved | 53 |
| 9 | non-plausible | not resolved | non-plausible | resolved | 4 |
| 10 | non-plausible | not resolved | non-plausible | not resolved | 216 |
| 11 | non-plausible | not resolved | n/a | n/a | 7 |
Rows 1-2 show the case where the first solution found
by aider with GPT-4o was plausible. Of those, 73 went on to be deemed as resolving the issue,
while 181 were not in fact correct solutions. Opus never got a try
at solving these problems, because the harness stopped once a
plausible solution was found.
The remaining rows consider cases where aider with GPT-4o
did not find a plausible solution, so Opus had a turn to try and solve.
Rows 3-6 are cases where GPT-4o's non-plausible solutions were
actually found to be correct in hindsight,
but in rows 4 we can see that aider with Opus overrides
2 of them with a plausible-but-incorrect
solution.
The original correct solutions from GPT-4o may not have been
plausible because of pre-existing or otherwise
unresolved editing, linting or testing errors which were unrelated
to the SWE Bench issue or which turned out to be non-fatal.
In rows 5-6 & 9-10 we can see that both GPT-4o and Opus
produced non-plausible solutions,
and which one was selected has to do with the
[details about which solution the harness considered "most plausible"]().
Row 11 contains cases where Opus returned errors due to context window
exhaustion or other problems.
In these cases aider with Opus was unable to produce any solutions.
## Computing the benchmark score ## Computing the benchmark score
After benchmarking, Benchmarking produced one candidate solution for each of
a separate evaluation script was used to the 570 SWE Bench problems.
A separate evaluation script was used to
test each of these solutions with the full test suite, test each of these solutions with the full test suite,
including the held out acceptance tests. including the held out acceptance tests.
For this final acceptance testing, any edits that aider made to tests For this final acceptance testing, any edits that aider made to tests
were discarded. were discarded.
This ensured that the correct, This ensured that the correct,
unmodified test suite is used for acceptance testing. unmodified test suite was used for acceptance testing.
The evaluation script compared the test results The evaluation script compared each candidate solution's test results
with results from testing with results from testing
the "gold" patch that was developed by a human to correctly solve the issue. the "gold" patch that was developed by a human to correctly solve the issue.
If they matched, the candidate solution correctly resolved the issue. If they matched, the candidate solution correctly resolved the issue.
These acceptance tests were only ever run outside of aider These acceptance tests were only ever run outside of aider
and the benchmark harness, and only to compute the number of and the benchmark harness, and only to compute statistics about the
correctly resolved instances. correctly resolved instances.
They were never run, used, or even visible during aider's attempts to solve the problems. They were never run, used, or even visible during aider's attempts to solve the problems.