copy

2025-05-31 01:35:00 +00:00 · 2024-05-31 16:59:15 -07:00 · 2024-05-31 16:59:15 -07:00 · f16e741bcb
commit f16e741bcb
parent 83081a5e6f
1 changed files with 28 additions and 23 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -1,11 +1,11 @@
 ---
-title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite
+title: Aider is SOTA for both SWE Bench and SWE Bench Lite
 excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
 highlight_image: /assets/swe_bench.jpg
 draft: true
 ---

-# Aider is SOTA for both the main SWE Bench and SWE Bench Lite
+# Aider is SOTA for both SWE Bench and SWE Bench Lite
 
 Aider scored 18.8%
 on the main
@ -156,38 +156,43 @@ relevant to the acceptance testing.
 The SWE Bench acceptance testing just confirms that tests pass or fail
 in the same pattern as the "gold patch" developed by a human to solve the
 problem.
-Some tests may still fail, and that's ok as long they fail for the gold
+Some tests may fail during acceptance testing,
+and that's ok as long they failed for the gold
 patch too.
 - There may have been pre-existing linting problems in the repo.
-If they were in code paths that are irrelevant to the problem being solved
-they might not affect acceptance testing.
-Even if aider was unable to resolve the linting errors,
-the solution may still be valid and pass acceptance testing.
- Aider may have reported file editing errors because it didn't think it was
-able to successfully apply all the edits the LLM specified.
-In this scenario, the LLM must have specified edits in an invalid
-format that doesn't comply with its
-system prompt instructions.
-So it may be that the LLM was somewhat confused and was
-asking for redundant or otherwise
-irrelevant edits.
+If they were in code paths that are irrelevant to the problem being solved,
+then aider's failure to resolve them might not affect acceptance testing.
+- Aider may have reported file editing errors because it thought the LLM
+specified edits that it wasn't able to successfully apply.
+In such a scenario, the LLM must have specified edits in
+a way that doesn't comply with the edit format
+specified in its system prompt.
+Aider tries hard to deal with non-compliant LLM edits,
+but still sometimes fails.
+So the LLM may have become confused and
+asked for redundant or otherwise irrelevant edits.
 Such outstanding edit errors might not be fatal for acceptance testing.
 - Etc.

-Keeping this in mind, we can understand why
-the first row in the table above
-shows GPT-4o accounting for 15.3% of the benchmark score,
-less than the 17.0% result reported earlier in the article
-for just one attempt of aider with GPT-4o.
-When an Opus attempt is allowed, it may propose some *incorrect* solutions which
+Keeping all this in mind, we can understand why
+GPT-4o accounts for 15.3% of the benchmark score in the table above,
+but we reported that
+just one attempt of aider with GPT-4o scored 17.0%.
+When an Opus attempt is allowed after GPT-4o,
+it may propose some *incorrect* solutions which
 are "more plausible" than some of GPT-4o's non-plausible solutions.
 These more plausible, incorrect solutions can
 eclipse some of
 the earlier non-plausible correct solutions that GPT-4o generated.
+This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus
+benchmark,
+as compared to the results from just one try using aider with GPT-4o (17.0%).

-For this reason, adding additional attempts is not guaranteed to monotonically
+For these reasons, adding additional attempts is not guaranteed to monotonically
 increase the number of resolved problems.
-Luckily additional attempts usually provide a net increase in the overall
+The new solutions may solve some new problems but they may also
+eclipse and discard some of the previous non-plausible correct solutions.
+Luckily, additional attempts usually provide a net increase in the overall
 number of resolved solutions.
 This was the case for both this main SWE Bench result and the
 earlier Lite result.