diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index 78ce77de6..ffa077294 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -16,9 +16,9 @@ from Amazon Q Developer Agent. The best result reported elsewhere seems to be [13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report). -This is in addition to +This result on the main SWE Bench is in addition to [aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html) -that was reported last week. +that was reported recently. [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg) @@ -57,11 +57,10 @@ with the problem statement submitted as the opening chat message from "the user". - After that aider ran as normal, except all of aider's suggestions were always accepted without user approval. -- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. +- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. Plausibly correct means that aider reported that it had successfully edited the repo without causing syntax errors or breaking any *pre-existing* tests. -- If the solution from aider with GPT-4o isn't plausible, the harness launches aider to try again from scratch, -this time using Claude 3 Opus. +- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch, this time using Claude 3 Opus. - If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems. It's important to be clear that @@ -73,20 +72,22 @@ correctly resolved. This is the same methodology that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). -The only difference is that for this result -at most two tries were attempted instead of six, -due to the increased token costs involved in this benchmark. -The SWE Bench problems are more difficult and involve edits to +Aider alternated between GPT-4o and Opus for up to 6 total attempts +on the Lite benchmark. +Due to the increased token costs involved in running +the main SWE Bench benchmark, aider was limited to 2 total attempts. +Problems from the main SWE Bench dataset +are more difficult and involve edits to more than one source file, -which increased the cost of solving each problem. -Further, aider was benchmarked on 570 SWE Bench problems, +which increased the token costs of solving each problem. +Further, aider was benchmarked on 570 SWE Bench problems versus only 300 Lite problems, adding another factor of ~two to the costs. For a detailed discussion of the methodology, please see the [article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html). The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains -the harness and reporting code used for the benchmarks. +the harness and analysis code used for the benchmarks. The benchmarking process was similar to how a developer might use aider to resolve a GitHub issue: @@ -103,8 +104,7 @@ so it's always easy to revert AI changes that don't pan out. ## Aider with GPT-4o alone was SOTA -Running the benchmark harness -only using aider with GPT-4o to find plausible solutions with a single attempt +Using aider with GPT-4o to make a single attempt at solving each problem achieved a score of 17.0%. This was itself a state-of-the-art result, before being surpassed by the main result being reported here @@ -112,13 +112,13 @@ that used aider with both GPT-4o & Opus. ## Aider with GPT-4o & Opus -The benchmark harness started by running aider with GPT-4o once to try +The benchmark harness ran aider with GPT-4o to try and solve the problem. If -no plausible solution was found, it then used aider with Opus -once to try and solve the problem. +no plausible solution was found, it ran aider with Opus +to try and solve the problem. The table below breaks down the proposed solutions that -were found for the 570 problems. +were found from each attempt for the 570 problems. A proposed solution is either: - A plausible solution where @@ -137,22 +137,55 @@ verified as correctly resolving their issue. ## Non-plausible but correct solutions? -It's worth noting that the first row of the table above -only scored 15.3% on the benchmark, -which differs from the 17.0% result reported above for aider with just GPT-4o. -This is because making additional attempts is not guaranteed to -monotonically increase the number of resolved issues. -Later attempts may propose solutions which -seem "more plausible" than prior attempts, -but which are actually worse solutions. -Luckily the later attempts usually provide a net increase in the overall +A solution doesn't have to be plausible in order to correctly resolve the issue. +Recall that plausible is simply defined as aider +reporting that it successfully edited files, +repaired and resolved any linting errors +and repaired tests so that they all passed. +But there are lots of reasons why aider might fail to do those things +and yet the solution is still a correct solution that will pass +acceptance testing: + +- There could be pre-existing failing tests in the repo, +before aider even starts working on the SWE Bench problem. +Aider may not resolve such issues, and yet they may turn out not to be +relevant to the acceptance testing. +The SWE Bench acceptance testing just confirms that tests pass or fail +in the same pattern as the "gold patch" developed by a human to solve the +problem. +Some tests may still fail, and that's ok as long they fail for the gold +patch too. +- There could be pre-existing linting problems in the repo, +which are in code paths that are irrelevant to the problem being solved +and to acceptance testing. +If aider is unable to resolve them, the solution may still be valid +and pass acceptance testing. +- Aider may report editing errors because it doesn't think it was +able to successfully apply all the edits the LLM specified. +In this scenario, the LLM has specified edits in an invalid +format that doesn't comply with its +system prompt instructions. +So it may be that the LLM was asking for redundant or otherwise +irrelevant edits, such that outstanding edit errors are actually not fatal. + +This is why the first row in the table above +shows GPT-4o accounting for 15.3% of the benchmark score, +which is different than the 17.0% result reported earlier +for aider with just GPT-4o. +The second attempt from Opus may propose solutions which +are "more plausible" than some of GPT-4's non-plausible solutions, +but which are actually incorrect solutions. +These more plausible but incorrect solutions can +eclipse the earlier non-plausible correct +solution. +Luckily the full set of later attempts usually provide a net increase in the overall number of resolved solutions, as is the case here. -This table breaks down the plausibility of each solution proposed by -aider with GPT-4o and with Opus, as well as whether it was actually -a correct solution. +The table below breaks down the plausibility of each solution proposed by +aider with GPT-4o and with Opus, and indicates which were actually +correct solutions. -|Row|GPT-4o
solution
plausible?|GPT-4o
solution
resolved issue?|Opus
solution
plausible?|Opus
solution
resolved issue?|Count| +|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Count| |---:|--:|--:|--:|--:|--:| | 1 | plausible | resolved | n/a | n/a | 73 | | 2 | plausible | not resolved | n/a | n/a | 181 | @@ -173,16 +206,12 @@ at solving these problems, because the harness stopped once a plausible solution was found. The remaining rows consider cases where aider with GPT-4o -did not find a plausible solution, so Opus had a turn to try and solve. +did not find a plausible solution, so Opus got a turn to try and solve. Rows 3-6 are cases where GPT-4o's non-plausible solutions were actually found to be correct in hindsight, -but in rows 4 we can see that aider with Opus overrides +but in row 4 we can see that aider with Opus overrides 2 of them with a plausible-but-incorrect solution. -The original correct solutions from GPT-4o may not have been -plausible because of pre-existing or otherwise -unresolved editing, linting or testing errors which were unrelated -to the SWE Bench issue or which turned out to be non-fatal. In rows 5-6 & 9-10 we can see that both GPT-4o and Opus produced non-plausible solutions, diff --git a/assets/swe_bench.jpg b/assets/swe_bench.jpg index 5df496d66..1796f2720 100644 Binary files a/assets/swe_bench.jpg and b/assets/swe_bench.jpg differ diff --git a/assets/swe_bench.svg b/assets/swe_bench.svg index d79ba8334..8abdd70a8 100644 --- a/assets/swe_bench.svg +++ b/assets/swe_bench.svg @@ -6,7 +6,7 @@ - 2024-05-31T11:28:28.622491 + 2024-05-31T11:41:49.017547 image/svg+xml @@ -30,8 +30,8 @@ z - - - + - + - + - + - + - + - - + + + + + + + + + + + - - - - - - - - + + + + - + - + @@ -702,19 +707,19 @@ z - + - + - + @@ -722,7 +727,7 @@ z - + - + @@ -808,12 +813,12 @@ z - + - + - + - + @@ -999,7 +1004,7 @@ z - + @@ -1012,12 +1017,12 @@ z - + - + - + - + - + @@ -1128,7 +1133,7 @@ z - + @@ -1137,7 +1142,7 @@ z - + @@ -1149,12 +1154,12 @@ z - + - + @@ -1162,7 +1167,7 @@ z - + @@ -1171,7 +1176,7 @@ z - + - + @@ -1240,23 +1245,23 @@ z - + - - + - + - + - + - + @@ -1294,18 +1299,18 @@ L 690 270.625716 - + - + - + @@ -1314,18 +1319,18 @@ L 690 238.690433 - + - + - + @@ -1334,18 +1339,18 @@ L 690 206.755149 - + - + - + - + - + - + @@ -1389,18 +1394,18 @@ L 690 142.884582 - + - + - + @@ -1410,18 +1415,18 @@ L 690 110.949298 - + - + - + @@ -1431,7 +1436,7 @@ L 690 79.014014 - + - - - @@ -1558,64 +1563,64 @@ L 690 50.4 " style="fill: none; stroke: #dddddd; stroke-width: 0.5; stroke-linejoin: miter; stroke-linecap: square"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #b3d1e6; opacity: 0.3"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #b3d1e6; opacity: 0.3"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #b3d1e6; opacity: 0.3"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #b3d1e6; opacity: 0.3"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #b3d1e6; opacity: 0.3"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #17965a; opacity: 0.9"/> - +" clip-path="url(#p1ec2c53f8e)" style="fill: #17965a; opacity: 0.9"/> - + @@ -1625,7 +1630,7 @@ z - + - + @@ -1677,7 +1682,7 @@ z - + - + @@ -1772,7 +1777,7 @@ z - + - + - + + diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt index 7d5f34ee2..fee177e32 100644 --- a/benchmark/swe-bench.txt +++ b/benchmark/swe-bench.txt @@ -3,5 +3,5 @@ 13.9% Devin|(570) 13.8% Amazon Q|Developer|Agent|(2294) 12.5% SWE-|Agent|+ GPT-4|(2294) -10.6% AutoCode|Rover|(2294) +10.6% Auto|Code|Rover|(2294) 10.5% SWE-|Agent|+ Opus|(2294) diff --git a/benchmark/swe_bench_lite.py b/benchmark/swe_bench_lite.py index 0488c6bb7..fe9489cb5 100644 --- a/benchmark/swe_bench_lite.py +++ b/benchmark/swe_bench_lite.py @@ -76,7 +76,7 @@ def plot_swe_bench_lite(data_file): ax.set_title(title, fontsize=20) # ax.set_ylim(0, 29.9) plt.xticks( - fontsize=17, + fontsize=16, color=font_color, )