copy

2025-05-30 17:24:59 +00:00 · 2024-05-31 15:28:34 -07:00 · 2024-05-31 15:28:34 -07:00 · 83081a5e6f
commit 83081a5e6f
parent 6a2d7e08c2
1 changed files with 81 additions and 63 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -23,14 +23,16 @@ that was reported recently.
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)

 Aider was benchmarked on 570 of the 2294 SWE Bench problems.
-These are the same [randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that
-[Devin used in their evalulation](https://www.cognition.ai/post/swe-bench-technical-report).
+These are the same
+[randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that
+[Devin used in their evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
 Please see the [references](#references)
 for more details on the data presented in this chart.

 ## Interactive, not agentic

-Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
+Aider achieved this result mainly through its existing features that focus on static
+code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
 Aider intentionally has quite limited and narrow "agentic behavior"
 to avoid long delays, high token costs
 and the need for users to repeatedly code review incorrect solutions.
@ -60,8 +62,8 @@ suggestions were always accepted without user approval.
 - A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
 Plausibly correct means that aider reported that it had successfully edited the repo
 without causing syntax errors or breaking any *pre-existing* tests.
- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch, this time using Claude 3 Opus.
- If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.
+- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
+- If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems.

 It's important to be clear that
 *aider and the benchmark harness
@ -70,8 +72,9 @@ The held out "acceptance tests" were *only* used
 after benchmarking to compute statistics on which problems aider
 correctly resolved.

-This is the same methodology
-that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
+This is the same approach
+that was used for
+[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
 Aider alternated between GPT-4o and Opus for up to 6 total attempts
 on the Lite benchmark.
 Due to the increased token costs involved in running
@ -113,12 +116,12 @@ that used aider with both GPT-4o & Opus.
 ## Aider with GPT-4o & Opus

 The benchmark harness ran aider with GPT-4o to try
-and solve the problem. If
-no plausible solution was found, it ran aider with Opus
+and solve the problem. If a plausible solution wasn't found,
+it ran aider with Opus
 to try and solve the problem.

 The table below breaks down the proposed solutions that
-were found from each attempt for the 570 problems.
+were found from each attempt at the 570 problems.
 A proposed solution is either:

 - A plausible solution where
@ -139,86 +142,101 @@ verified as correctly resolving their issue.

 A solution doesn't have to be plausible in order to correctly resolve the issue.
 Recall that plausible is simply defined as aider
-reporting that it successfully edited files,
+reporting that it successfully completed all file edits,
 repaired and resolved any linting errors
-and repaired tests so that they all passed.
-But there are lots of reasons why aider might fail to do those things
-and yet the solution is still a correct solution that will pass
+and resolved any test failures.
+But there are many reasons why aider might fail to do those things
+and yet still produce a solution that will pass
 acceptance testing:

- There could be pre-existing failing tests in the repo,
-before aider even starts working on the SWE Bench problem.
-Aider may not resolve such issues, and yet they may turn out not to be
+- There may have been pre-existing failing tests in the repo,
+before aider even started working on the SWE Bench problem.
+Aider may not have resolved such issues, and yet they may not to be
 relevant to the acceptance testing.
 The SWE Bench acceptance testing just confirms that tests pass or fail
 in the same pattern as the "gold patch" developed by a human to solve the
 problem.
 Some tests may still fail, and that's ok as long they fail for the gold
 patch too.
- There could be pre-existing linting problems in the repo,
-which are in code paths that are irrelevant to the problem being solved
-and to acceptance testing.
-If aider is unable to resolve them, the solution may still be valid
-and pass acceptance testing.
- Aider may report editing errors because it doesn't think it was
+- There may have been pre-existing linting problems in the repo.
+If they were in code paths that are irrelevant to the problem being solved
+they might not affect acceptance testing.
+Even if aider was unable to resolve the linting errors,
+the solution may still be valid and pass acceptance testing.
+- Aider may have reported file editing errors because it didn't think it was
 able to successfully apply all the edits the LLM specified.
-In this scenario, the LLM has specified edits in an invalid
+In this scenario, the LLM must have specified edits in an invalid
 format that doesn't comply with its
 system prompt instructions.
-So it may be that the LLM was asking for redundant or otherwise
-irrelevant edits, such that outstanding edit errors are actually not fatal.
+So it may be that the LLM was somewhat confused and was
+asking for redundant or otherwise
+irrelevant edits.
+Such outstanding edit errors might not be fatal for acceptance testing.
+- Etc.

-This is why the first row in the table above
+Keeping this in mind, we can understand why
+the first row in the table above
 shows GPT-4o accounting for 15.3% of the benchmark score,
-which is different than the 17.0% result reported earlier
-for aider with just GPT-4o.
-The second attempt from Opus may propose solutions which
-are "more plausible" than some of GPT-4's non-plausible solutions,
-but which are actually incorrect solutions.
-These more plausible but incorrect solutions can
-eclipse the earlier non-plausible correct
-solution.
-Luckily the full set of later attempts usually provide a net increase in the overall
-number of resolved solutions, as is the case here.
+less than the 17.0% result reported earlier in the article
+for just one attempt of aider with GPT-4o.
+When an Opus attempt is allowed, it may propose some *incorrect* solutions which
+are "more plausible" than some of GPT-4o's non-plausible solutions.
+These more plausible, incorrect solutions can
+eclipse some of
+the earlier non-plausible correct solutions that GPT-4o generated.
+
+For this reason, adding additional attempts is not guaranteed to monotonically
+increase the number of resolved problems.
+Luckily additional attempts usually provide a net increase in the overall
+number of resolved solutions.
+This was the case for both this main SWE Bench result and the
+earlier Lite result.

 The table below breaks down the plausibility of each solution proposed by
 aider with GPT-4o and with Opus, and indicates which were actually
 correct solutions.

 |Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Count|
-|---:|--:|--:|--:|--:|--:|
-|  1 | plausible       | resolved        | n/a             | n/a             |  73 |
-|  2 | plausible       | not resolved    | n/a             | n/a             | 181 |
-|  3 | non-plausible   | resolved        | plausible       | resolved        |   1 |
-|  4 | non-plausible   | resolved        | plausible       | not resolved    |   2 |
-|  5 | non-plausible   | resolved        | non-plausible   | resolved        |  16 |
-|  6 | non-plausible   | resolved        | non-plausible   | not resolved    |   5 |
-|  7 | non-plausible   | not resolved    | plausible       | resolved        |  12 |
-|  8 | non-plausible   | not resolved    | plausible       | not resolved    |  53 |
-|  9 | non-plausible   | not resolved    | non-plausible   | resolved        |   4 |
-| 10 | non-plausible   | not resolved    | non-plausible   | not resolved    | 216 |
-| 11 | non-plausible   | not resolved    | n/a             | n/a             |   7 |
+|:--:|--:|--:|--:|--:|--:|
+|  A | plausible       | resolved        | n/a             | n/a             |  73 |
+|  B | plausible       | not resolved    | n/a             | n/a             | 181 |
+|  C | non-plausible   | resolved        | plausible       | resolved        |   1 |
+|  D | non-plausible   | resolved        | plausible       | not resolved    |   2 |
+|  E | non-plausible   | resolved        | non-plausible   | resolved        |  16 |
+|  F | non-plausible   | resolved        | non-plausible   | not resolved    |   5 |
+|  G | non-plausible   | not resolved    | non-plausible   | resolved        |   4 |
+|  H | non-plausible   | not resolved    | non-plausible   | not resolved    | 216 |
+|  I | non-plausible   | not resolved    | plausible       | resolved        |  12 |
+|  J | non-plausible   | not resolved    | plausible       | not resolved    |  53 |
+|  K | non-plausible   | not resolved    | n/a             | n/a             |   7 |

-Rows 1-2 show the case where the first solution found
-by aider with GPT-4o was plausible. Of those, 73 went on to be deemed as resolving the issue,
-while 181 were not in fact correct solutions. Opus never got a try
-at solving these problems, because the harness stopped once a
+Rows A-B show the cases where
+aider with GPT-4o found a plausible solution during the first attempt.
+Of those, 73 went on to be deemed as resolving the issue,
+while 181 were not in fact correct solutions.
+The second attempt with Opus never happened,
+because the harness stopped once a
 plausible solution was found.

 The remaining rows consider cases where aider with GPT-4o
 did not find a plausible solution, so Opus got a turn to try and solve.
-Rows 3-6 are cases where GPT-4o's non-plausible solutions were
-actually found to be correct in hindsight,
-but in row 4 we can see that aider with Opus overrides
-2 of them with a plausible-but-incorrect
-solution.
+Rows C-F are cases where GPT-4o's non-plausible solutions were
+actually found to be correct in hindsight.
+In row D we can see the cases where aider with Opus overrides
+2 of them with plausible-but-incorrect
+solutions.

-In rows 5-6 & 9-10 we can see that both GPT-4o and Opus
-produced non-plausible solutions,
-and which one was selected has to do with the
-[details about which solution the harness considered "most plausible"]().
+In rows E-H we can see that both GPT-4o and Opus
+produced non-plausible solutions.
+Which one was ultimately selected has to do with the
+[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).

-Row 11 contains cases where Opus returned errors due to context window
+Rows I-J consider the simple cases where aider with GPT-4o
+didn't find a plausible solution but Opus did.
+Of these, Opus' solution went on to be deemed correct for 12 problems
+and incorrect for 53.
+
+Row K contains cases where Opus returned errors due to context window
 exhaustion or other problems. 
 In these cases aider with Opus was unable to produce any solutions.