copy

2025-06-03 19:24:59 +00:00 · 2024-06-01 15:05:29 -07:00 · 2024-06-01 15:05:29 -07:00 · 47a3cb8adf
commit 47a3cb8adf
parent 2febc663f3
4 changed files with 135 additions and 105 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -7,7 +7,7 @@ draft: true

 # Aider is SOTA for both SWE Bench and SWE Bench Lite
 
-Aider scored 18.8%
+Aider scored 18.9%
 on the main
 [SWE Bench benchmark](https://www.swebench.com),
 achieving a state-of-the-art result. 
@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing.
 - Or, the "most plausible" solution generated by either attempt, with the
 [fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).

-The table also provides details on the 107 solutions that were ultimately
+The table also provides details on the 108 solutions that were ultimately
 verified as correctly resolving their issue.

 | Attempt | Agent |Number&nbsp;of<br>proposed<br>solutions|Percent&nbsp;of<br>proposed<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
 |:--------:|------------|---------:|---------:|----:|---:|--:|
-| 1 | Aider with GPT-4o    | 419 | 73.5% | 87 | 81.3% | 15.3% |
-| 2 | Aider with Opus      | 151 | 26.5% | 20 | 18.7% |  3.5% |
-| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
+| 1 | Aider with GPT-4o    | 419 | 73.5% | 87 | 80.6% | 15.3% |
+| 2 | Aider with Opus      | 151 | 26.5% | 21 | 19.4% |  3.7% |
+| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |

 ## Non-plausible but correct solutions?

@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus
 produced plausible and/or correct solutions.

 |Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|Number of<br>problems<br>resolved|
-|:--:|--:|--:|--:|--:|--:|--:|
-| A | **plausible**       | **resolved**        | n/a             | n/a             |  73 |  73 |
-| B | **plausible**       | not resolved    | n/a             | n/a             | 181 |   0 |
-| C | non-plausible   | **resolved**        | **plausible**       | **resolved**        |   1 |   1 |
-| D | non-plausible   | **resolved**        | **plausible**       | not resolved    |   2 |   0 |
-| E | non-plausible   | not resolved    | **plausible**       | **resolved**        |  12 |  12 |
-| F | non-plausible   | not resolved    | **plausible**       | not resolved    |  53 |   0 |
-| G | non-plausible   | **resolved**        | non-plausible   | **resolved**        |  16 |  16 |
-| H | non-plausible   | **resolved**        | non-plausible   | not resolved    |   5 |   3 |
-| I | non-plausible   | not resolved    | non-plausible   | **resolved**        |   4 |   2 |
-| J | non-plausible   | not resolved    | non-plausible   | not resolved    | 216 |   0 |
-| K | non-plausible   | not resolved    | n/a             | n/a             |   7 |   0 |
-|Total|||||570|107|
+|:--:|:--:|:--:|:--:|:--:|--:|--:|
+| A | **plausible**   | **resolved**    | n/a             | n/a             |  73 |  73 |
+| B | **plausible**   | no              | n/a             | n/a             | 181 |   0 |
+| C | no              | no              | **plausible**   | no              |  53 |   0 |
+| D | no              | no              | **plausible**   | **resolved**    |  12 |  12 |
+| E | no              | **resolved**    | **plausible**   | no              |   2 |   0 |
+| F | no              | **resolved**    | **plausible**   | **resolved**    |   1 |   1 |
+| G | no              | no              | no              | no              | 216 |   0 |
+| H | no              | no              | no              | **resolved**    |   4 |   2 |
+| I | no              | **resolved**    | no              | no              |   4 |   3 |
+| J | no              | **resolved**    | no              | **resolved**    |  17 |  17 |
+| K | no              | no              | n/a             | n/a             |   7 |   0 |
+|Total|||||570|108|

 Rows A-B show the cases where
 aider with GPT-4o found a plausible solution during the first attempt.
@ -233,7 +233,7 @@ So Opus' solutions were adopted and they
 went on to be deemed correct for 13 problems
 and incorrect for 55.

-Row D is an interesting special case, where GPT-4o found 2
+In that group, Row E is an interesting special case, where GPT-4o found 2
 non-plausible but correct solutions.
 We can see that Opus overrides
 them with plausible-but-incorrect
@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the
 correctly resolved instances.
 They were never run, used, or even visible during aider's attempts to resolve the problems.

-Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
-or 18.8%.
+Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
+or 18.9%.

 ## Acknowledgments