diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md
index 4e9ffa5df..57e7d9b79 100644
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@@ -23,8 +23,8 @@ that was reported recently.
[](https://aider.chat/assets/swe_bench.svg)
Aider was benchmarked on the same
-[random 570](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
-of the 2294 SWE Bench problems that were used in the
+[randomly selected 570](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
+of the 2,294 SWE Bench problems that were used in the
[Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
Please see the [references](#references)
for more details on the data presented in this chart.
@@ -187,68 +187,20 @@ are "more plausible" than some of GPT-4o's non-plausible solutions.
These more plausible, incorrect solutions can
eclipse some of
the earlier non-plausible correct solutions that GPT-4o generated.
-This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus
-benchmark,
-as compared to the results from just one try using aider with GPT-4o (17.0%).
+This is why GPT-4o's score in the table
+showing the combined GPT-4o & Opus results (15.3%)
+is lower than the result from just one try using aider with GPT-4o (17.0%).
For these reasons, adding additional attempts is not guaranteed to monotonically
increase the number of resolved problems.
New solutions may resolve some new problems but they may also
eclipse and discard some of the previous non-plausible correct solutions.
+
Luckily, additional attempts usually provide a net increase in the overall
number of resolved solutions.
This was the case for both this main SWE Bench result and the
earlier Lite result.
-The table below breaks down the benchmark outcome of each problem,
-showing whether aider with GPT-4o and with Opus
-produced plausible and/or correct solutions.
-
-|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome|Number of
problems
resolved|
-|:--:|:--:|:--:|:--:|:--:|--:|--:|
-| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
-| B | **plausible** | no | n/a | n/a | 181 | 0 |
-| C | no | no | **plausible** | no | 53 | 0 |
-| D | no | no | **plausible** | **resolved** | 12 | 12 |
-| E | no | **resolved** | **plausible** | no | 2 | 0 |
-| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 |
-| G | no | no | no | no | 216 | 0 |
-| H | no | no | no | **resolved** | 4 | 2 |
-| I | no | **resolved** | no | no | 4 | 3 |
-| J | no | **resolved** | no | **resolved** | 17 | 17 |
-| K | no | no | n/a | n/a | 7 | 0 |
-|Total|||||570|108|
-
-Rows A-B show the cases where
-aider with GPT-4o found a plausible solution during the first attempt.
-Of those, 73 went on to be deemed as resolving the issue,
-while 181 were not in fact correct solutions.
-The second attempt with Opus never happened,
-because the harness stopped once a
-plausible solution was found.
-
-Rows C-F consider the straightforward cases where aider with GPT-4o
-didn't find a plausible solution but Opus did.
-So Opus' solutions were adopted and they
-went on to be deemed correct for 13 problems
-and incorrect for 55.
-
-In that group, Row E is an interesting special case, where GPT-4o found 2
-non-plausible but correct solutions.
-We can see that Opus overrides
-them with plausible-but-incorrect
-solutions resulting in 0 resolved problems from that row.
-
-Rows G-K cover the cases where neither model
-produced plausible solutions.
-Which solution was ultimately selected for each problem depends on
-[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
-
-Row K contains cases where Opus returned errors due to context window
-exhaustion or other problems.
-In these cases aider with Opus was unable to produce any solutions
-so GPT-4o's solutions were adopted.
-
## Computing the benchmark score
The benchmark harness produced one proposed solution for each of
@@ -289,11 +241,11 @@ making it faster, easier, and more reliable to run the acceptance tests.
Below are the references for the SWE-Bench results
displayed in the graph at the beginning of this article.
-- [13.9% Devin (benchmarked on 570 instances)](https://www.cognition.ai/post/swe-bench-technical-report)
-- [13.8% Amazon Q Developer Agent (benchmarked on 2294 instances)](https://www.swebench.com)
-- [12.5% SWE- Agent + GPT-4 (benchmarked on 2294 instances)](https://www.swebench.com)
-- [10.6% AutoCode Rover (benchmarked on 2294 instances)](https://arxiv.org/pdf/2404.05427v2)
-- [10.5% SWE- Agent + Opus (benchmarked on 2294 instances)](https://www.swebench.com)
+- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
+- [13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.](https://www.swebench.com)
+- [12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.](https://www.swebench.com)
+- [10.6% AutoCode Rover, benchmarked on 2,294 instances.](https://arxiv.org/pdf/2404.05427v2)
+- [10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.](https://www.swebench.com)
The graph contains average pass@1 results for AutoCodeRover.
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
diff --git a/assets/swe_bench.jpg b/assets/swe_bench.jpg
index 175eb7063..9b5029fa8 100644
Binary files a/assets/swe_bench.jpg and b/assets/swe_bench.jpg differ
diff --git a/assets/swe_bench.svg b/assets/swe_bench.svg
index cdafbfae7..149381f98 100644
--- a/assets/swe_bench.svg
+++ b/assets/swe_bench.svg
@@ -6,7 +6,7 @@
- 2024-06-01T14:55:22.797792
+ 2024-06-01T16:00:26.751322
image/svg+xml
@@ -41,12 +41,12 @@ z
-
-
+
@@ -412,7 +412,7 @@ z
-
+
@@ -583,7 +583,7 @@ z
-
+
@@ -699,7 +699,7 @@ z
-
+
@@ -894,7 +894,7 @@ z
-
+
@@ -926,7 +926,7 @@ z
-
+
@@ -1157,7 +1157,7 @@ z
-
+
@@ -1339,16 +1339,16 @@ z
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
-
+
@@ -1394,11 +1394,11 @@ z
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1467,11 +1467,11 @@ z
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1487,11 +1487,11 @@ L 690 242.845658
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1523,11 +1523,11 @@ z
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1557,11 +1557,11 @@ z
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1578,11 +1578,11 @@ L 690 145.618145
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1599,11 +1599,11 @@ L 690 113.208974
+" clip-path="url(#pb8819c8324)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1780,7 +1780,7 @@ L 170.425134 307.664
L 170.425134 171.545481
L 104.863636 171.545481
z
-" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#pb8819c8324)" style="fill: #1a75c2; opacity: 0.9"/>
+" clip-path="url(#pb8819c8324)" style="fill: #1a75c2; opacity: 0.9"/>
@@ -2212,8 +2212,8 @@ z
-
-
+
+
+
-
-
-
+
+
+
+
-
-
+
+
-
-
-
+
+
+
+
-
-
+
+
-
-
-
+
+
+
+
-
-
+
+
-
-
-
+
+
+
+
@@ -2386,7 +2404,7 @@ z
-
+
diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt
index b3e5674b5..338296a3e 100644
--- a/benchmark/swe-bench.txt
+++ b/benchmark/swe-bench.txt
@@ -1,7 +1,7 @@
18.9% Aider|GPT-4o|& Opus|(570)
17.0% Aider|GPT-4o|(570)
13.9% Devin|(570)
-13.8% Amazon Q|Developer|Agent|(2294)
-12.5% SWE-|Agent|+ GPT-4|(2294)
-10.6% Auto|Code|Rover|(2294)
-10.5% SWE-|Agent|+ Opus|(2294)
+13.8% Amazon Q|Developer|Agent|(2,294)
+12.5% SWE-|Agent|+ GPT-4|(2,294)
+10.6% Auto|Code|Rover|(2,294)
+10.5% SWE-|Agent|+ Opus|(2,294)