This commit is contained in:
Paul Gauthier 2024-06-01 05:54:11 -07:00
parent f16e741bcb
commit 6094104b6c

View file

@ -75,22 +75,26 @@ correctly resolved.
This is the same approach This is the same approach
that was used for that was used for
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
Aider alternated between GPT-4o and Opus for up to 6 total attempts For the Lite benchmark,
on the Lite benchmark. aider alternated between GPT-4o and Opus for up to 6 total attempts.
Due to the increased token costs involved in running Due to the increased token costs involved in running
the main SWE Bench benchmark, aider was limited to 2 total attempts. the main SWE Bench benchmark, aider was limited to 2 total attempts:
Problems from the main SWE Bench dataset one attempt of aider with GPT-4o and one with Opus.
are more difficult and involve edits to
more than one source file, The problems from the main SWE Bench dataset
which increased the token costs of solving each problem. are more difficult and involved edits to
multiple source files,
which increased the token costs as compared to Lite.
Further, aider was benchmarked on 570 SWE Bench problems Further, aider was benchmarked on 570 SWE Bench problems
versus only 300 Lite problems, versus only 300 Lite problems,
adding another factor of ~two to the costs. adding another factor of ~two to the costs.
For a detailed discussion of the methodology, please see the For a detailed discussion of the benchmark
methodology, please see the
[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html). [article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains Also, the
the harness and analysis code used for the benchmarks. [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
contains the harness and statistics code used for the benchmarks.
The benchmarking process was similar to how a developer might use aider to The benchmarking process was similar to how a developer might use aider to
resolve a GitHub issue: resolve a GitHub issue:
@ -115,10 +119,11 @@ that used aider with both GPT-4o & Opus.
## Aider with GPT-4o & Opus ## Aider with GPT-4o & Opus
The benchmark harness ran aider with GPT-4o to try The benchmark harness started by using aider with GPT-4o to try
and solve the problem. If a plausible solution wasn't found, and solve each problem.
it ran aider with Opus For problems where this didn't produce a plausible solution,
to try and solve the problem. the harness tried again using aider with Opus.
So at most, two attempts were made for each problem.
The table below breaks down the proposed solutions that The table below breaks down the proposed solutions that
were found from each attempt at the 570 problems. were found from each attempt at the 570 problems.
@ -160,24 +165,21 @@ Some tests may fail during acceptance testing,
and that's ok as long they failed for the gold and that's ok as long they failed for the gold
patch too. patch too.
- There may have been pre-existing linting problems in the repo. - There may have been pre-existing linting problems in the repo.
If they were in code paths that are irrelevant to the problem being solved, If lingering linting issues affected code paths that are not well tested,
then aider's failure to resolve them might not affect acceptance testing. they may not impact acceptance testing.
- Aider may have reported file editing errors because it thought the LLM - Aider may have reported file editing errors because it thought the LLM
specified edits that it wasn't able to successfully apply. specified edits that it wasn't able to successfully apply.
In such a scenario, the LLM must have specified edits in This can only happen when the LLM specified edits in
a way that doesn't comply with the edit format a way that doesn't comply with the editing instructions in the system prompt.
specified in its system prompt. Given that the LLM isn't complying with the system prompt,
Aider tries hard to deal with non-compliant LLM edits, it may have become confused and
but still sometimes fails.
So the LLM may have become confused and
asked for redundant or otherwise irrelevant edits. asked for redundant or otherwise irrelevant edits.
Such outstanding edit errors might not be fatal for acceptance testing. Such outstanding edit errors might not be fatal for acceptance testing.
- Etc. - Etc.
Keeping all this in mind, we can understand why Keeping all this in mind, we can understand why
GPT-4o accounts for 15.3% of the benchmark score in the table above, GPT-4o accounts for 15.3% of the benchmark score in the table above,
but we reported that but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
just one attempt of aider with GPT-4o scored 17.0%.
When an Opus attempt is allowed after GPT-4o, When an Opus attempt is allowed after GPT-4o,
it may propose some *incorrect* solutions which it may propose some *incorrect* solutions which
are "more plausible" than some of GPT-4o's non-plausible solutions. are "more plausible" than some of GPT-4o's non-plausible solutions.
@ -190,18 +192,18 @@ as compared to the results from just one try using aider with GPT-4o (17.0%).
For these reasons, adding additional attempts is not guaranteed to monotonically For these reasons, adding additional attempts is not guaranteed to monotonically
increase the number of resolved problems. increase the number of resolved problems.
The new solutions may solve some new problems but they may also New solutions may solve some new problems but they may also
eclipse and discard some of the previous non-plausible correct solutions. eclipse and discard some of the previous non-plausible correct solutions.
Luckily, additional attempts usually provide a net increase in the overall Luckily, additional attempts usually provide a net increase in the overall
number of resolved solutions. number of resolved solutions.
This was the case for both this main SWE Bench result and the This was the case for both this main SWE Bench result and the
earlier Lite result. earlier Lite result.
The table below breaks down the plausibility of each solution proposed by The table below breaks down the benchmark outcome of each problem,
aider with GPT-4o and with Opus, and indicates which were actually show whether aider with GPT-4o and with Opus
correct solutions. produced plausible and/or correct solutions.
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Count| |Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|
|:--:|--:|--:|--:|--:|--:| |:--:|--:|--:|--:|--:|--:|
| A | plausible | resolved | n/a | n/a | 73 | | A | plausible | resolved | n/a | n/a | 73 |
| B | plausible | not resolved | n/a | n/a | 181 | | B | plausible | not resolved | n/a | n/a | 181 |
@ -214,6 +216,7 @@ correct solutions.
| I | non-plausible | not resolved | plausible | resolved | 12 | | I | non-plausible | not resolved | plausible | resolved | 12 |
| J | non-plausible | not resolved | plausible | not resolved | 53 | | J | non-plausible | not resolved | plausible | not resolved | 53 |
| K | non-plausible | not resolved | n/a | n/a | 7 | | K | non-plausible | not resolved | n/a | n/a | 7 |
|Total|||||570|
Rows A-B show the cases where Rows A-B show the cases where
aider with GPT-4o found a plausible solution during the first attempt. aider with GPT-4o found a plausible solution during the first attempt.
@ -227,16 +230,17 @@ The remaining rows consider cases where aider with GPT-4o
did not find a plausible solution, so Opus got a turn to try and solve. did not find a plausible solution, so Opus got a turn to try and solve.
Rows C-F are cases where GPT-4o's non-plausible solutions were Rows C-F are cases where GPT-4o's non-plausible solutions were
actually found to be correct in hindsight. actually found to be correct in hindsight.
In row D we can see the cases where aider with Opus overrides In row D we can see the cases where aider with Opus
definitely overrides
2 of them with plausible-but-incorrect 2 of them with plausible-but-incorrect
solutions. solutions.
In rows E-H we can see that both GPT-4o and Opus In rows E-H we can see that both GPT-4o and Opus
produced non-plausible solutions. produced non-plausible solutions.
Which one was ultimately selected has to do with the Which one was ultimately selected for each problem depends on
[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). [details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
Rows I-J consider the simple cases where aider with GPT-4o Rows I-J consider the straightforward cases where aider with GPT-4o
didn't find a plausible solution but Opus did. didn't find a plausible solution but Opus did.
Of these, Opus' solution went on to be deemed correct for 12 problems Of these, Opus' solution went on to be deemed correct for 12 problems
and incorrect for 53. and incorrect for 53.
@ -247,7 +251,7 @@ In these cases aider with Opus was unable to produce any solutions.
## Computing the benchmark score ## Computing the benchmark score
Benchmarking produced one candidate solution for each of Benchmarking produced one proposed solution for each of
the 570 SWE Bench problems. the 570 SWE Bench problems.
A separate evaluation script was used to A separate evaluation script was used to
@ -257,10 +261,10 @@ For this final acceptance testing, any edits that aider made to tests
were discarded. were discarded.
This ensured that the correct, This ensured that the correct,
unmodified test suite was used for acceptance testing. unmodified test suite was used for acceptance testing.
The evaluation script compared each candidate solution's test results The evaluation script compared each proposed solution's test results
with results from testing with results from testing
the "gold" patch that was developed by a human to correctly solve the issue. the "gold" patch that was developed by a human to correctly solve the issue.
If they matched, the candidate solution correctly resolved the issue. If they matched, the proposed solution correctly resolved the issue.
These acceptance tests were only ever run outside of aider These acceptance tests were only ever run outside of aider
and the benchmark harness, and only to compute statistics about the and the benchmark harness, and only to compute statistics about the
@ -299,7 +303,8 @@ Table 2 of their
[paper](https://arxiv.org/pdf/2404.05427v2) [paper](https://arxiv.org/pdf/2404.05427v2)
reports an `ACR-avg` result of 10.59% which is an average pass@1 result. reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
The [official SWE Bench Lite leaderboard](https://www.swebench.com) The results presented here for aider are all pass@1, as
the [official SWE Bench Lite leaderboard](https://www.swebench.com)
only accepts pass@1 results. only accepts pass@1 results.