From f16e741bcb8475b3eed93d3523d044f38f713677 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Fri, 31 May 2024 16:59:15 -0700 Subject: [PATCH] copy --- _posts/2024-05-31-both-swe-bench.md | 51 ++++++++++++++++------------- 1 file changed, 28 insertions(+), 23 deletions(-) diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index c4b7cd34e..809fb6d52 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -1,11 +1,11 @@ --- -title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite +title: Aider is SOTA for both SWE Bench and SWE Bench Lite excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version. highlight_image: /assets/swe_bench.jpg draft: true --- -# Aider is SOTA for both the main SWE Bench and SWE Bench Lite +# Aider is SOTA for both SWE Bench and SWE Bench Lite Aider scored 18.8% on the main @@ -156,38 +156,43 @@ relevant to the acceptance testing. The SWE Bench acceptance testing just confirms that tests pass or fail in the same pattern as the "gold patch" developed by a human to solve the problem. -Some tests may still fail, and that's ok as long they fail for the gold +Some tests may fail during acceptance testing, +and that's ok as long they failed for the gold patch too. - There may have been pre-existing linting problems in the repo. -If they were in code paths that are irrelevant to the problem being solved -they might not affect acceptance testing. -Even if aider was unable to resolve the linting errors, -the solution may still be valid and pass acceptance testing. -- Aider may have reported file editing errors because it didn't think it was -able to successfully apply all the edits the LLM specified. -In this scenario, the LLM must have specified edits in an invalid -format that doesn't comply with its -system prompt instructions. -So it may be that the LLM was somewhat confused and was -asking for redundant or otherwise -irrelevant edits. +If they were in code paths that are irrelevant to the problem being solved, +then aider's failure to resolve them might not affect acceptance testing. +- Aider may have reported file editing errors because it thought the LLM +specified edits that it wasn't able to successfully apply. +In such a scenario, the LLM must have specified edits in +a way that doesn't comply with the edit format +specified in its system prompt. +Aider tries hard to deal with non-compliant LLM edits, +but still sometimes fails. +So the LLM may have become confused and +asked for redundant or otherwise irrelevant edits. Such outstanding edit errors might not be fatal for acceptance testing. - Etc. -Keeping this in mind, we can understand why -the first row in the table above -shows GPT-4o accounting for 15.3% of the benchmark score, -less than the 17.0% result reported earlier in the article -for just one attempt of aider with GPT-4o. -When an Opus attempt is allowed, it may propose some *incorrect* solutions which +Keeping all this in mind, we can understand why +GPT-4o accounts for 15.3% of the benchmark score in the table above, +but we reported that +just one attempt of aider with GPT-4o scored 17.0%. +When an Opus attempt is allowed after GPT-4o, +it may propose some *incorrect* solutions which are "more plausible" than some of GPT-4o's non-plausible solutions. These more plausible, incorrect solutions can eclipse some of the earlier non-plausible correct solutions that GPT-4o generated. +This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus +benchmark, +as compared to the results from just one try using aider with GPT-4o (17.0%). -For this reason, adding additional attempts is not guaranteed to monotonically +For these reasons, adding additional attempts is not guaranteed to monotonically increase the number of resolved problems. -Luckily additional attempts usually provide a net increase in the overall +The new solutions may solve some new problems but they may also +eclipse and discard some of the previous non-plausible correct solutions. +Luckily, additional attempts usually provide a net increase in the overall number of resolved solutions. This was the case for both this main SWE Bench result and the earlier Lite result.