mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
copy
This commit is contained in:
parent
83081a5e6f
commit
f16e741bcb
1 changed files with 28 additions and 23 deletions
|
@ -1,11 +1,11 @@
|
||||||
---
|
---
|
||||||
title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite
|
title: Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||||
excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
|
excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
|
||||||
highlight_image: /assets/swe_bench.jpg
|
highlight_image: /assets/swe_bench.jpg
|
||||||
draft: true
|
draft: true
|
||||||
---
|
---
|
||||||
|
|
||||||
# Aider is SOTA for both the main SWE Bench and SWE Bench Lite
|
# Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||||
|
|
||||||
Aider scored 18.8%
|
Aider scored 18.8%
|
||||||
on the main
|
on the main
|
||||||
|
@ -156,38 +156,43 @@ relevant to the acceptance testing.
|
||||||
The SWE Bench acceptance testing just confirms that tests pass or fail
|
The SWE Bench acceptance testing just confirms that tests pass or fail
|
||||||
in the same pattern as the "gold patch" developed by a human to solve the
|
in the same pattern as the "gold patch" developed by a human to solve the
|
||||||
problem.
|
problem.
|
||||||
Some tests may still fail, and that's ok as long they fail for the gold
|
Some tests may fail during acceptance testing,
|
||||||
|
and that's ok as long they failed for the gold
|
||||||
patch too.
|
patch too.
|
||||||
- There may have been pre-existing linting problems in the repo.
|
- There may have been pre-existing linting problems in the repo.
|
||||||
If they were in code paths that are irrelevant to the problem being solved
|
If they were in code paths that are irrelevant to the problem being solved,
|
||||||
they might not affect acceptance testing.
|
then aider's failure to resolve them might not affect acceptance testing.
|
||||||
Even if aider was unable to resolve the linting errors,
|
- Aider may have reported file editing errors because it thought the LLM
|
||||||
the solution may still be valid and pass acceptance testing.
|
specified edits that it wasn't able to successfully apply.
|
||||||
- Aider may have reported file editing errors because it didn't think it was
|
In such a scenario, the LLM must have specified edits in
|
||||||
able to successfully apply all the edits the LLM specified.
|
a way that doesn't comply with the edit format
|
||||||
In this scenario, the LLM must have specified edits in an invalid
|
specified in its system prompt.
|
||||||
format that doesn't comply with its
|
Aider tries hard to deal with non-compliant LLM edits,
|
||||||
system prompt instructions.
|
but still sometimes fails.
|
||||||
So it may be that the LLM was somewhat confused and was
|
So the LLM may have become confused and
|
||||||
asking for redundant or otherwise
|
asked for redundant or otherwise irrelevant edits.
|
||||||
irrelevant edits.
|
|
||||||
Such outstanding edit errors might not be fatal for acceptance testing.
|
Such outstanding edit errors might not be fatal for acceptance testing.
|
||||||
- Etc.
|
- Etc.
|
||||||
|
|
||||||
Keeping this in mind, we can understand why
|
Keeping all this in mind, we can understand why
|
||||||
the first row in the table above
|
GPT-4o accounts for 15.3% of the benchmark score in the table above,
|
||||||
shows GPT-4o accounting for 15.3% of the benchmark score,
|
but we reported that
|
||||||
less than the 17.0% result reported earlier in the article
|
just one attempt of aider with GPT-4o scored 17.0%.
|
||||||
for just one attempt of aider with GPT-4o.
|
When an Opus attempt is allowed after GPT-4o,
|
||||||
When an Opus attempt is allowed, it may propose some *incorrect* solutions which
|
it may propose some *incorrect* solutions which
|
||||||
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
||||||
These more plausible, incorrect solutions can
|
These more plausible, incorrect solutions can
|
||||||
eclipse some of
|
eclipse some of
|
||||||
the earlier non-plausible correct solutions that GPT-4o generated.
|
the earlier non-plausible correct solutions that GPT-4o generated.
|
||||||
|
This reduces GPT-4o's score in the table (15.3%) from the combined GPT-4o & Opus
|
||||||
|
benchmark,
|
||||||
|
as compared to the results from just one try using aider with GPT-4o (17.0%).
|
||||||
|
|
||||||
For this reason, adding additional attempts is not guaranteed to monotonically
|
For these reasons, adding additional attempts is not guaranteed to monotonically
|
||||||
increase the number of resolved problems.
|
increase the number of resolved problems.
|
||||||
Luckily additional attempts usually provide a net increase in the overall
|
The new solutions may solve some new problems but they may also
|
||||||
|
eclipse and discard some of the previous non-plausible correct solutions.
|
||||||
|
Luckily, additional attempts usually provide a net increase in the overall
|
||||||
number of resolved solutions.
|
number of resolved solutions.
|
||||||
This was the case for both this main SWE Bench result and the
|
This was the case for both this main SWE Bench result and the
|
||||||
earlier Lite result.
|
earlier Lite result.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue