diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index 4e9ffa5df..57e7d9b79 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -23,8 +23,8 @@ that was reported recently. [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg) Aider was benchmarked on the same -[random 570](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) -of the 2294 SWE Bench problems that were used in the +[randomly selected 570](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) +of the 2,294 SWE Bench problems that were used in the [Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report). Please see the [references](#references) for more details on the data presented in this chart. @@ -187,68 +187,20 @@ are "more plausible" than some of GPT-4o's non-plausible solutions. These more plausible, incorrect solutions can eclipse some of the earlier non-plausible correct solutions that GPT-4o generated. -This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus -benchmark, -as compared to the results from just one try using aider with GPT-4o (17.0%). +This is why GPT-4o's score in the table +showing the combined GPT-4o & Opus results (15.3%) +is lower than the result from just one try using aider with GPT-4o (17.0%). For these reasons, adding additional attempts is not guaranteed to monotonically increase the number of resolved problems. New solutions may resolve some new problems but they may also eclipse and discard some of the previous non-plausible correct solutions. + Luckily, additional attempts usually provide a net increase in the overall number of resolved solutions. This was the case for both this main SWE Bench result and the earlier Lite result. -The table below breaks down the benchmark outcome of each problem, -showing whether aider with GPT-4o and with Opus -produced plausible and/or correct solutions. - -|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome|Number of
problems
resolved| -|:--:|:--:|:--:|:--:|:--:|--:|--:| -| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 | -| B | **plausible** | no | n/a | n/a | 181 | 0 | -| C | no | no | **plausible** | no | 53 | 0 | -| D | no | no | **plausible** | **resolved** | 12 | 12 | -| E | no | **resolved** | **plausible** | no | 2 | 0 | -| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 | -| G | no | no | no | no | 216 | 0 | -| H | no | no | no | **resolved** | 4 | 2 | -| I | no | **resolved** | no | no | 4 | 3 | -| J | no | **resolved** | no | **resolved** | 17 | 17 | -| K | no | no | n/a | n/a | 7 | 0 | -|Total|||||570|108| - -Rows A-B show the cases where -aider with GPT-4o found a plausible solution during the first attempt. -Of those, 73 went on to be deemed as resolving the issue, -while 181 were not in fact correct solutions. -The second attempt with Opus never happened, -because the harness stopped once a -plausible solution was found. - -Rows C-F consider the straightforward cases where aider with GPT-4o -didn't find a plausible solution but Opus did. -So Opus' solutions were adopted and they -went on to be deemed correct for 13 problems -and incorrect for 55. - -In that group, Row E is an interesting special case, where GPT-4o found 2 -non-plausible but correct solutions. -We can see that Opus overrides -them with plausible-but-incorrect -solutions resulting in 0 resolved problems from that row. - -Rows G-K cover the cases where neither model -produced plausible solutions. -Which solution was ultimately selected for each problem depends on -[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). - -Row K contains cases where Opus returned errors due to context window -exhaustion or other problems. -In these cases aider with Opus was unable to produce any solutions -so GPT-4o's solutions were adopted. - ## Computing the benchmark score The benchmark harness produced one proposed solution for each of @@ -289,11 +241,11 @@ making it faster, easier, and more reliable to run the acceptance tests. Below are the references for the SWE-Bench results displayed in the graph at the beginning of this article. -- [13.9% Devin (benchmarked on 570 instances)](https://www.cognition.ai/post/swe-bench-technical-report) -- [13.8% Amazon Q Developer Agent (benchmarked on 2294 instances)](https://www.swebench.com) -- [12.5% SWE- Agent + GPT-4 (benchmarked on 2294 instances)](https://www.swebench.com) -- [10.6% AutoCode Rover (benchmarked on 2294 instances)](https://arxiv.org/pdf/2404.05427v2) -- [10.5% SWE- Agent + Opus (benchmarked on 2294 instances)](https://www.swebench.com) +- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report) +- [13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.](https://www.swebench.com) +- [12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.](https://www.swebench.com) +- [10.6% AutoCode Rover, benchmarked on 2,294 instances.](https://arxiv.org/pdf/2404.05427v2) +- [10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.](https://www.swebench.com) The graph contains average pass@1 results for AutoCodeRover. The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover) diff --git a/assets/swe_bench.jpg b/assets/swe_bench.jpg index 175eb7063..9b5029fa8 100644 Binary files a/assets/swe_bench.jpg and b/assets/swe_bench.jpg differ diff --git a/assets/swe_bench.svg b/assets/swe_bench.svg index cdafbfae7..149381f98 100644 --- a/assets/swe_bench.svg +++ b/assets/swe_bench.svg @@ -6,7 +6,7 @@ - 2024-06-01T14:55:22.797792 + 2024-06-01T16:00:26.751322 image/svg+xml @@ -41,12 +41,12 @@ z - - + @@ -412,7 +412,7 @@ z - + @@ -583,7 +583,7 @@ z - + @@ -699,7 +699,7 @@ z - + @@ -894,7 +894,7 @@ z - + @@ -926,7 +926,7 @@ z - + @@ -1157,7 +1157,7 @@ z - + @@ -1339,16 +1339,16 @@ z +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - - + @@ -1394,11 +1394,11 @@ z +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1467,11 +1467,11 @@ z +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1487,11 +1487,11 @@ L 690 242.845658 +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1523,11 +1523,11 @@ z +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1557,11 +1557,11 @@ z +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1578,11 +1578,11 @@ L 690 145.618145 +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1599,11 +1599,11 @@ L 690 113.208974 +" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1780,7 +1780,7 @@ L 170.425134 307.664 L 170.425134 171.545481 L 104.863636 171.545481 z -" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#pb8819c8324)" style="fill: #1a75c2; opacity: 0.9"/> +" clip-path="url(#pb8819c8324)" style="fill: #1a75c2; opacity: 0.9"/> @@ -2212,8 +2212,8 @@ z - - + + + - - - + + + + - - + + - - - + + + + - - + + - - - + + + + - - + + - - - + + + + @@ -2386,7 +2404,7 @@ z - + diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt index b3e5674b5..338296a3e 100644 --- a/benchmark/swe-bench.txt +++ b/benchmark/swe-bench.txt @@ -1,7 +1,7 @@ 18.9% Aider|GPT-4o|& Opus|(570) 17.0% Aider|GPT-4o|(570) 13.9% Devin|(570) -13.8% Amazon Q|Developer|Agent|(2294) -12.5% SWE-|Agent|+ GPT-4|(2294) -10.6% Auto|Code|Rover|(2294) -10.5% SWE-|Agent|+ Opus|(2294) +13.8% Amazon Q|Developer|Agent|(2,294) +12.5% SWE-|Agent|+ GPT-4|(2,294) +10.6% Auto|Code|Rover|(2,294) +10.5% SWE-|Agent|+ Opus|(2,294)