This commit is contained in:
Paul Gauthier 2024-05-31 16:59:15 -07:00
parent 83081a5e6f
commit f16e741bcb

View file

@ -1,11 +1,11 @@
--- ---
title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite title: Aider is SOTA for both SWE Bench and SWE Bench Lite
excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version. excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
highlight_image: /assets/swe_bench.jpg highlight_image: /assets/swe_bench.jpg
draft: true draft: true
--- ---
# Aider is SOTA for both the main SWE Bench and SWE Bench Lite # Aider is SOTA for both SWE Bench and SWE Bench Lite
Aider scored 18.8% Aider scored 18.8%
on the main on the main
@ -156,38 +156,43 @@ relevant to the acceptance testing.
The SWE Bench acceptance testing just confirms that tests pass or fail The SWE Bench acceptance testing just confirms that tests pass or fail
in the same pattern as the "gold patch" developed by a human to solve the in the same pattern as the "gold patch" developed by a human to solve the
problem. problem.
Some tests may still fail, and that's ok as long they fail for the gold Some tests may fail during acceptance testing,
and that's ok as long they failed for the gold
patch too. patch too.
- There may have been pre-existing linting problems in the repo. - There may have been pre-existing linting problems in the repo.
If they were in code paths that are irrelevant to the problem being solved If they were in code paths that are irrelevant to the problem being solved,
they might not affect acceptance testing. then aider's failure to resolve them might not affect acceptance testing.
Even if aider was unable to resolve the linting errors, - Aider may have reported file editing errors because it thought the LLM
the solution may still be valid and pass acceptance testing. specified edits that it wasn't able to successfully apply.
- Aider may have reported file editing errors because it didn't think it was In such a scenario, the LLM must have specified edits in
able to successfully apply all the edits the LLM specified. a way that doesn't comply with the edit format
In this scenario, the LLM must have specified edits in an invalid specified in its system prompt.
format that doesn't comply with its Aider tries hard to deal with non-compliant LLM edits,
system prompt instructions. but still sometimes fails.
So it may be that the LLM was somewhat confused and was So the LLM may have become confused and
asking for redundant or otherwise asked for redundant or otherwise irrelevant edits.
irrelevant edits.
Such outstanding edit errors might not be fatal for acceptance testing. Such outstanding edit errors might not be fatal for acceptance testing.
- Etc. - Etc.
Keeping this in mind, we can understand why Keeping all this in mind, we can understand why
the first row in the table above GPT-4o accounts for 15.3% of the benchmark score in the table above,
shows GPT-4o accounting for 15.3% of the benchmark score, but we reported that
less than the 17.0% result reported earlier in the article just one attempt of aider with GPT-4o scored 17.0%.
for just one attempt of aider with GPT-4o. When an Opus attempt is allowed after GPT-4o,
When an Opus attempt is allowed, it may propose some *incorrect* solutions which it may propose some *incorrect* solutions which
are "more plausible" than some of GPT-4o's non-plausible solutions. are "more plausible" than some of GPT-4o's non-plausible solutions.
These more plausible, incorrect solutions can These more plausible, incorrect solutions can
eclipse some of eclipse some of
the earlier non-plausible correct solutions that GPT-4o generated. the earlier non-plausible correct solutions that GPT-4o generated.
This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus
benchmark,
as compared to the results from just one try using aider with GPT-4o (17.0%).
For this reason, adding additional attempts is not guaranteed to monotonically For these reasons, adding additional attempts is not guaranteed to monotonically
increase the number of resolved problems. increase the number of resolved problems.
Luckily additional attempts usually provide a net increase in the overall The new solutions may solve some new problems but they may also
eclipse and discard some of the previous non-plausible correct solutions.
Luckily, additional attempts usually provide a net increase in the overall
number of resolved solutions. number of resolved solutions.
This was the case for both this main SWE Bench result and the This was the case for both this main SWE Bench result and the
earlier Lite result. earlier Lite result.