copy

2025-06-02 10:45:00 +00:00 · 2024-05-31 11:23:13 -07:00 · 2024-05-31 11:23:13 -07:00 · 7fe697c1df
commit 7fe697c1df
parent 38ac9de678
1 changed files with 71 additions and 30 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -13,6 +13,8 @@ on the main
 achieving a state-of-the-art result. 
 The current top leaderboard entry is 13.8%
 from Amazon Q Developer Agent.
 The best result reported elsewhere seems to be
 [13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
 This is in addition to
 [aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html)
@ -48,17 +50,17 @@ avoid wasting time and token costs.
 ## Benchmark methodology
-For the benchmark, 
+Benchmarking was conducted as follows:
 aider with GPT-4o was launched in each problem's git repository
 with the problem statement
 submitted as the opening chat message from "the user."
 After that aider runs as normal, with the following modifications:
- Aider's suggestions were always accepted without user approval.
+- Aider with GPT-4o was launched in each problem's git repository
 with the problem statement
 submitted as the opening chat message from "the user".
 - After that aider ran as normal, except all of aider's
 suggestions were always accepted without user approval.
 - A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
 Plausibly correct means that aider reported that it had successfully edited the repo
 without causing syntax errors or breaking any *pre-existing* tests.
- If the solution isn't plausible, the harness launches aider to try again from scratch,
+- If the solution from aider with GPT-4o isn't plausible, the harness launches aider to try again from scratch,
 this time using Claude 3 Opus.
 - If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.
@ -71,7 +73,8 @@ correctly resolved.
 This is the same methodology
 that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
-The only difference is that at most two tries were attempted instead of six,
+The only difference is that for this result
 at most two tries were attempted instead of six,
 due to the increased token costs involved in this benchmark.
 The SWE Bench problems are more difficult and involve edits to
 more than one source file,
@ -132,45 +135,83 @@ verified as correctly resolving their issue.
 | 2 | Aider with Opus      | 151 | 26.5% | 20 | 18.7% |  3.5% |
 | **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
-If we break down the solutions solely by model,
+## Non-plausible but correct solutions?
 we can see that aider with GPT-4o outperforms Opus.
 This isn't a fair and direct comparison, because GPT-4o always took the first
 turn and therefore got first crack at all the "easiest" problems.
 Aider with Opus only ever saw problems that GPT-4o failed to
 find proposed solutions for on its first try.
-Aider with GPT-4o was producing higher quality proposed solutions,
+It's worth noting that the first row of the table above
-with a greater chance of going on to be accepted as resolving the issue.
+only scored 15.3% on the benchmark,
-Again, this is biased by the turn ordering.
+which differs from the 17.0% result reported above for aider with just GPT-4o.
-But other anecdotal evidence from earlier runs of the benchmark
+This is because making additional attempts is not guaranteed to
-also supports the observation that aider with GPT-4o is significantly stronger than Opus
+monotonically increase the number of resolved issues.
-for this benchmark.
+Later attempts may propose solutions which
 seem "more plausible" than prior attempts,
 but which are actually worse solutions.
 Luckily the later attempts usually provide a net increase in the overall
 number of resolved solutions, as is the case here.
 This table breaks down the plausibility of each solution proposed by
 aider with GPT-4o and with Opus, as well as whether it was actually
 a correct solution.
-| Agent      | Number&nbsp;of<br>proposed<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>proposed<br>which<br>correctly<br>resolved<br>| 
+|Row|GPT-4o<br>solution<br>plausible?|GPT-4o<br>solution<br>resolved issue?|Opus<br>solution<br>plausible?|Opus<br>solution<br>resolved issue?|Count|
-|------------|---------:|---------:|---:|
+|---:|--:|--:|--:|--:|--:|
-| Aider with GPT-4o    | 419 | 87 |20.8% |
+|  1 | plausible       | resolved        | n/a             | n/a             |  73 |
-| Aider with Opus      | 151 | 20 |13.2% |
+|  2 | plausible       | not resolved    | n/a             | n/a             | 181 |
-| **Total** | **570** | **107** |**18.8%** |
+|  3 | non-plausible   | resolved        | plausible       | resolved        |   1 |
 |  4 | non-plausible   | resolved        | plausible       | not resolved    |   2 |
 |  5 | non-plausible   | resolved        | non-plausible   | resolved        |  16 |
 |  6 | non-plausible   | resolved        | non-plausible   | not resolved    |   5 |
 |  7 | non-plausible   | not resolved    | plausible       | resolved        |  12 |
 |  8 | non-plausible   | not resolved    | plausible       | not resolved    |  53 |
 |  9 | non-plausible   | not resolved    | non-plausible   | resolved        |   4 |
 | 10 | non-plausible   | not resolved    | non-plausible   | not resolved    | 216 |
 | 11 | non-plausible   | not resolved    | n/a             | n/a             |   7 |
 Rows 1-2 show the case where the first solution found
 by aider with GPT-4o was plausible. Of those, 73 went on to be deemed as resolving the issue,
 while 181 were not in fact correct solutions. Opus never got a try
 at solving these problems, because the harness stopped once a
 plausible solution was found.
 The remaining rows consider cases where aider with GPT-4o
 did not find a plausible solution, so Opus had a turn to try and solve.
 Rows 3-6 are cases where GPT-4o's non-plausible solutions were
 actually found to be correct in hindsight,
 but in rows 4 we can see that aider with Opus overrides
 2 of them with a plausible-but-incorrect
 solution.
 The original correct solutions from GPT-4o may not have been
 plausible because of pre-existing or otherwise
 unresolved editing, linting or testing errors which were unrelated
 to the SWE Bench issue or which turned out to be non-fatal.
 In rows 5-6 & 9-10 we can see that both GPT-4o and Opus
 produced non-plausible solutions,
 and which one was selected has to do with the
 [details about which solution the harness considered "most plausible"]().
 Row 11 contains cases where Opus returned errors due to context window
 exhaustion or other problems. 
 In these cases aider with Opus was unable to produce any solutions.
 ## Computing the benchmark score
-After benchmarking,
+Benchmarking produced one candidate solution for each of
-a separate evaluation script was used to
+the 570 SWE Bench problems.
 A separate evaluation script was used to
 test each of these solutions with the full test suite,
 including the held out acceptance tests.
 For this final acceptance testing, any edits that aider made to tests
 were discarded.
 This ensured that the correct,
-unmodified test suite is used for acceptance testing.
+unmodified test suite was used for acceptance testing.
-The evaluation script compared the test results
+The evaluation script compared each candidate solution's test results
 with results from testing
 the "gold" patch that was developed by a human to correctly solve the issue.
 If they matched, the candidate solution correctly resolved the issue.
 These acceptance tests were only ever run outside of aider
-and the benchmark harness, and only to compute the number of
+and the benchmark harness, and only to compute statistics about the
 correctly resolved instances.
 They were never run, used, or even visible during aider's attempts to solve the problems.