copy

2025-06-01 02:05:00 +00:00 · 2024-05-31 11:23:13 -07:00 · 2024-05-31 11:23:13 -07:00 · 7fe697c1df
commit 7fe697c1df
parent 38ac9de678
1 changed files with 71 additions and 30 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -13,6 +13,8 @@ on the main
 achieving a state-of-the-art result. 
 The current top leaderboard entry is 13.8%
 from Amazon Q Developer Agent.
+The best result reported elsewhere seems to be
+[13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).

 This is in addition to
 [aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html)
@ -48,17 +50,17 @@ avoid wasting time and token costs.

 ## Benchmark methodology

-For the benchmark, 
-aider with GPT-4o was launched in each problem's git repository
-with the problem statement
-submitted as the opening chat message from "the user."
-After that aider runs as normal, with the following modifications:
+Benchmarking was conducted as follows:

- Aider's suggestions were always accepted without user approval.
+- Aider with GPT-4o was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user".
+- After that aider ran as normal, except all of aider's
+suggestions were always accepted without user approval.
 - A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
 Plausibly correct means that aider reported that it had successfully edited the repo
 without causing syntax errors or breaking any *pre-existing* tests.
- If the solution isn't plausible, the harness launches aider to try again from scratch,
+- If the solution from aider with GPT-4o isn't plausible, the harness launches aider to try again from scratch,
 this time using Claude 3 Opus.
 - If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.

@ -71,7 +73,8 @@ correctly resolved.

 This is the same methodology
 that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
-The only difference is that at most two tries were attempted instead of six,
+The only difference is that for this result
+at most two tries were attempted instead of six,
 due to the increased token costs involved in this benchmark.
 The SWE Bench problems are more difficult and involve edits to
 more than one source file,
@ -132,45 +135,83 @@ verified as correctly resolving their issue.
 | 2 | Aider with Opus      | 151 | 26.5% | 20 | 18.7% |  3.5% |
 | **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |

-If we break down the solutions solely by model,
-we can see that aider with GPT-4o outperforms Opus.
-This isn't a fair and direct comparison, because GPT-4o always took the first
-turn and therefore got first crack at all the "easiest" problems.
-Aider with Opus only ever saw problems that GPT-4o failed to
-find proposed solutions for on its first try.
+## Non-plausible but correct solutions?

-Aider with GPT-4o was producing higher quality proposed solutions,
-with a greater chance of going on to be accepted as resolving the issue.
-Again, this is biased by the turn ordering.
-But other anecdotal evidence from earlier runs of the benchmark
-also supports the observation that aider with GPT-4o is significantly stronger than Opus
-for this benchmark.
+It's worth noting that the first row of the table above
+only scored 15.3% on the benchmark,
+which differs from the 17.0% result reported above for aider with just GPT-4o.
+This is because making additional attempts is not guaranteed to
+monotonically increase the number of resolved issues.
+Later attempts may propose solutions which
+seem "more plausible" than prior attempts,
+but which are actually worse solutions.
+Luckily the later attempts usually provide a net increase in the overall
+number of resolved solutions, as is the case here.

+This table breaks down the plausibility of each solution proposed by
+aider with GPT-4o and with Opus, as well as whether it was actually
+a correct solution.

-| Agent      | Number&nbsp;of<br>proposed<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>proposed<br>which<br>correctly<br>resolved<br>| 
-|------------|---------:|---------:|---:|
-| Aider with GPT-4o    | 419 | 87 |20.8% |
-| Aider with Opus      | 151 | 20 |13.2% |
-| **Total** | **570** | **107** |**18.8%** |
+|Row|GPT-4o<br>solution<br>plausible?|GPT-4o<br>solution<br>resolved issue?|Opus<br>solution<br>plausible?|Opus<br>solution<br>resolved issue?|Count|
+|---:|--:|--:|--:|--:|--:|
+|  1 | plausible       | resolved        | n/a             | n/a             |  73 |
+|  2 | plausible       | not resolved    | n/a             | n/a             | 181 |
+|  3 | non-plausible   | resolved        | plausible       | resolved        |   1 |
+|  4 | non-plausible   | resolved        | plausible       | not resolved    |   2 |
+|  5 | non-plausible   | resolved        | non-plausible   | resolved        |  16 |
+|  6 | non-plausible   | resolved        | non-plausible   | not resolved    |   5 |
+|  7 | non-plausible   | not resolved    | plausible       | resolved        |  12 |
+|  8 | non-plausible   | not resolved    | plausible       | not resolved    |  53 |
+|  9 | non-plausible   | not resolved    | non-plausible   | resolved        |   4 |
+| 10 | non-plausible   | not resolved    | non-plausible   | not resolved    | 216 |
+| 11 | non-plausible   | not resolved    | n/a             | n/a             |   7 |

+Rows 1-2 show the case where the first solution found
+by aider with GPT-4o was plausible. Of those, 73 went on to be deemed as resolving the issue,
+while 181 were not in fact correct solutions. Opus never got a try
+at solving these problems, because the harness stopped once a
+plausible solution was found.
+
+The remaining rows consider cases where aider with GPT-4o
+did not find a plausible solution, so Opus had a turn to try and solve.
+Rows 3-6 are cases where GPT-4o's non-plausible solutions were
+actually found to be correct in hindsight,
+but in rows 4 we can see that aider with Opus overrides
+2 of them with a plausible-but-incorrect
+solution.
+The original correct solutions from GPT-4o may not have been
+plausible because of pre-existing or otherwise
+unresolved editing, linting or testing errors which were unrelated
+to the SWE Bench issue or which turned out to be non-fatal.
+
+In rows 5-6 & 9-10 we can see that both GPT-4o and Opus
+produced non-plausible solutions,
+and which one was selected has to do with the
+[details about which solution the harness considered "most plausible"]().
+
+Row 11 contains cases where Opus returned errors due to context window
+exhaustion or other problems. 
+In these cases aider with Opus was unable to produce any solutions.

 ## Computing the benchmark score

-After benchmarking,
-a separate evaluation script was used to
+Benchmarking produced one candidate solution for each of
+the 570 SWE Bench problems.
+
+A separate evaluation script was used to
 test each of these solutions with the full test suite,
 including the held out acceptance tests.
 For this final acceptance testing, any edits that aider made to tests
 were discarded.
 This ensured that the correct,
-unmodified test suite is used for acceptance testing.
-The evaluation script compared the test results
+unmodified test suite was used for acceptance testing.
+The evaluation script compared each candidate solution's test results
 with results from testing
 the "gold" patch that was developed by a human to correctly solve the issue.
 If they matched, the candidate solution correctly resolved the issue.

 These acceptance tests were only ever run outside of aider
-and the benchmark harness, and only to compute the number of
+and the benchmark harness, and only to compute statistics about the
 correctly resolved instances.
 They were never run, used, or even visible during aider's attempts to solve the problems.