From 6094104b6cca9ca90e27e0ca36e6c643f8cc27dc Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Sat, 1 Jun 2024 05:54:11 -0700 Subject: [PATCH] copy --- _posts/2024-05-31-both-swe-bench.md | 77 +++++++++++++++-------------- 1 file changed, 41 insertions(+), 36 deletions(-) diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index 809fb6d52..925f64a5f 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -75,22 +75,26 @@ correctly resolved. This is the same approach that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). -Aider alternated between GPT-4o and Opus for up to 6 total attempts -on the Lite benchmark. +For the Lite benchmark, +aider alternated between GPT-4o and Opus for up to 6 total attempts. Due to the increased token costs involved in running -the main SWE Bench benchmark, aider was limited to 2 total attempts. -Problems from the main SWE Bench dataset -are more difficult and involve edits to -more than one source file, -which increased the token costs of solving each problem. +the main SWE Bench benchmark, aider was limited to 2 total attempts: +one attempt of aider with GPT-4o and one with Opus. + +The problems from the main SWE Bench dataset +are more difficult and involved edits to +multiple source files, +which increased the token costs as compared to Lite. Further, aider was benchmarked on 570 SWE Bench problems versus only 300 Lite problems, adding another factor of ~two to the costs. -For a detailed discussion of the methodology, please see the +For a detailed discussion of the benchmark +methodology, please see the [article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html). -The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains -the harness and analysis code used for the benchmarks. +Also, the +[aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) +contains the harness and statistics code used for the benchmarks. The benchmarking process was similar to how a developer might use aider to resolve a GitHub issue: @@ -115,10 +119,11 @@ that used aider with both GPT-4o & Opus. ## Aider with GPT-4o & Opus -The benchmark harness ran aider with GPT-4o to try -and solve the problem. If a plausible solution wasn't found, -it ran aider with Opus -to try and solve the problem. +The benchmark harness started by using aider with GPT-4o to try +and solve each problem. +For problems where this didn't produce a plausible solution, +the harness tried again using aider with Opus. +So at most, two attempts were made for each problem. The table below breaks down the proposed solutions that were found from each attempt at the 570 problems. @@ -160,24 +165,21 @@ Some tests may fail during acceptance testing, and that's ok as long they failed for the gold patch too. - There may have been pre-existing linting problems in the repo. -If they were in code paths that are irrelevant to the problem being solved, -then aider's failure to resolve them might not affect acceptance testing. +If lingering linting issues affected code paths that are not well tested, +they may not impact acceptance testing. - Aider may have reported file editing errors because it thought the LLM specified edits that it wasn't able to successfully apply. -In such a scenario, the LLM must have specified edits in -a way that doesn't comply with the edit format -specified in its system prompt. -Aider tries hard to deal with non-compliant LLM edits, -but still sometimes fails. -So the LLM may have become confused and +This can only happen when the LLM specified edits in +a way that doesn't comply with the editing instructions in the system prompt. +Given that the LLM isn't complying with the system prompt, +it may have become confused and asked for redundant or otherwise irrelevant edits. Such outstanding edit errors might not be fatal for acceptance testing. - Etc. Keeping all this in mind, we can understand why GPT-4o accounts for 15.3% of the benchmark score in the table above, -but we reported that -just one attempt of aider with GPT-4o scored 17.0%. +but benchmarking with just one attempt of aider with GPT-4o scored 17.0%. When an Opus attempt is allowed after GPT-4o, it may propose some *incorrect* solutions which are "more plausible" than some of GPT-4o's non-plausible solutions. @@ -190,18 +192,18 @@ as compared to the results from just one try using aider with GPT-4o (17.0%). For these reasons, adding additional attempts is not guaranteed to monotonically increase the number of resolved problems. -The new solutions may solve some new problems but they may also +New solutions may solve some new problems but they may also eclipse and discard some of the previous non-plausible correct solutions. Luckily, additional attempts usually provide a net increase in the overall number of resolved solutions. This was the case for both this main SWE Bench result and the earlier Lite result. -The table below breaks down the plausibility of each solution proposed by -aider with GPT-4o and with Opus, and indicates which were actually -correct solutions. +The table below breaks down the benchmark outcome of each problem, +show whether aider with GPT-4o and with Opus +produced plausible and/or correct solutions. -|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Count| +|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome| |:--:|--:|--:|--:|--:|--:| | A | plausible | resolved | n/a | n/a | 73 | | B | plausible | not resolved | n/a | n/a | 181 | @@ -214,6 +216,7 @@ correct solutions. | I | non-plausible | not resolved | plausible | resolved | 12 | | J | non-plausible | not resolved | plausible | not resolved | 53 | | K | non-plausible | not resolved | n/a | n/a | 7 | +|Total|||||570| Rows A-B show the cases where aider with GPT-4o found a plausible solution during the first attempt. @@ -227,16 +230,17 @@ The remaining rows consider cases where aider with GPT-4o did not find a plausible solution, so Opus got a turn to try and solve. Rows C-F are cases where GPT-4o's non-plausible solutions were actually found to be correct in hindsight. -In row D we can see the cases where aider with Opus overrides +In row D we can see the cases where aider with Opus +definitely overrides 2 of them with plausible-but-incorrect solutions. In rows E-H we can see that both GPT-4o and Opus produced non-plausible solutions. -Which one was ultimately selected has to do with the +Which one was ultimately selected for each problem depends on [details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). -Rows I-J consider the simple cases where aider with GPT-4o +Rows I-J consider the straightforward cases where aider with GPT-4o didn't find a plausible solution but Opus did. Of these, Opus' solution went on to be deemed correct for 12 problems and incorrect for 53. @@ -247,7 +251,7 @@ In these cases aider with Opus was unable to produce any solutions. ## Computing the benchmark score -Benchmarking produced one candidate solution for each of +Benchmarking produced one proposed solution for each of the 570 SWE Bench problems. A separate evaluation script was used to @@ -257,10 +261,10 @@ For this final acceptance testing, any edits that aider made to tests were discarded. This ensured that the correct, unmodified test suite was used for acceptance testing. -The evaluation script compared each candidate solution's test results +The evaluation script compared each proposed solution's test results with results from testing the "gold" patch that was developed by a human to correctly solve the issue. -If they matched, the candidate solution correctly resolved the issue. +If they matched, the proposed solution correctly resolved the issue. These acceptance tests were only ever run outside of aider and the benchmark harness, and only to compute statistics about the @@ -299,7 +303,8 @@ Table 2 of their [paper](https://arxiv.org/pdf/2404.05427v2) reports an `ACR-avg` result of 10.59% which is an average pass@1 result. -The [official SWE Bench Lite leaderboard](https://www.swebench.com) +The results presented here for aider are all pass@1, as +the [official SWE Bench Lite leaderboard](https://www.swebench.com) only accepts pass@1 results.