mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-31 17:55:01 +00:00
copy
This commit is contained in:
parent
6a2d7e08c2
commit
83081a5e6f
1 changed files with 81 additions and 63 deletions
|
@ -23,14 +23,16 @@ that was reported recently.
|
||||||
[](https://aider.chat/assets/swe_bench.svg)
|
[](https://aider.chat/assets/swe_bench.svg)
|
||||||
|
|
||||||
Aider was benchmarked on 570 of the 2294 SWE Bench problems.
|
Aider was benchmarked on 570 of the 2294 SWE Bench problems.
|
||||||
These are the same [randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that
|
These are the same
|
||||||
[Devin used in their evalulation](https://www.cognition.ai/post/swe-bench-technical-report).
|
[randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that
|
||||||
|
[Devin used in their evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
|
||||||
Please see the [references](#references)
|
Please see the [references](#references)
|
||||||
for more details on the data presented in this chart.
|
for more details on the data presented in this chart.
|
||||||
|
|
||||||
## Interactive, not agentic
|
## Interactive, not agentic
|
||||||
|
|
||||||
Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
Aider achieved this result mainly through its existing features that focus on static
|
||||||
|
code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||||
Aider intentionally has quite limited and narrow "agentic behavior"
|
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||||
to avoid long delays, high token costs
|
to avoid long delays, high token costs
|
||||||
and the need for users to repeatedly code review incorrect solutions.
|
and the need for users to repeatedly code review incorrect solutions.
|
||||||
|
@ -60,8 +62,8 @@ suggestions were always accepted without user approval.
|
||||||
- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||||
Plausibly correct means that aider reported that it had successfully edited the repo
|
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||||
without causing syntax errors or breaking any *pre-existing* tests.
|
without causing syntax errors or breaking any *pre-existing* tests.
|
||||||
- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch, this time using Claude 3 Opus.
|
- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
|
||||||
- If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.
|
- If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems.
|
||||||
|
|
||||||
It's important to be clear that
|
It's important to be clear that
|
||||||
*aider and the benchmark harness
|
*aider and the benchmark harness
|
||||||
|
@ -70,8 +72,9 @@ The held out "acceptance tests" were *only* used
|
||||||
after benchmarking to compute statistics on which problems aider
|
after benchmarking to compute statistics on which problems aider
|
||||||
correctly resolved.
|
correctly resolved.
|
||||||
|
|
||||||
This is the same methodology
|
This is the same approach
|
||||||
that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
that was used for
|
||||||
|
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||||
Aider alternated between GPT-4o and Opus for up to 6 total attempts
|
Aider alternated between GPT-4o and Opus for up to 6 total attempts
|
||||||
on the Lite benchmark.
|
on the Lite benchmark.
|
||||||
Due to the increased token costs involved in running
|
Due to the increased token costs involved in running
|
||||||
|
@ -113,12 +116,12 @@ that used aider with both GPT-4o & Opus.
|
||||||
## Aider with GPT-4o & Opus
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
The benchmark harness ran aider with GPT-4o to try
|
The benchmark harness ran aider with GPT-4o to try
|
||||||
and solve the problem. If
|
and solve the problem. If a plausible solution wasn't found,
|
||||||
no plausible solution was found, it ran aider with Opus
|
it ran aider with Opus
|
||||||
to try and solve the problem.
|
to try and solve the problem.
|
||||||
|
|
||||||
The table below breaks down the proposed solutions that
|
The table below breaks down the proposed solutions that
|
||||||
were found from each attempt for the 570 problems.
|
were found from each attempt at the 570 problems.
|
||||||
A proposed solution is either:
|
A proposed solution is either:
|
||||||
|
|
||||||
- A plausible solution where
|
- A plausible solution where
|
||||||
|
@ -139,86 +142,101 @@ verified as correctly resolving their issue.
|
||||||
|
|
||||||
A solution doesn't have to be plausible in order to correctly resolve the issue.
|
A solution doesn't have to be plausible in order to correctly resolve the issue.
|
||||||
Recall that plausible is simply defined as aider
|
Recall that plausible is simply defined as aider
|
||||||
reporting that it successfully edited files,
|
reporting that it successfully completed all file edits,
|
||||||
repaired and resolved any linting errors
|
repaired and resolved any linting errors
|
||||||
and repaired tests so that they all passed.
|
and resolved any test failures.
|
||||||
But there are lots of reasons why aider might fail to do those things
|
But there are many reasons why aider might fail to do those things
|
||||||
and yet the solution is still a correct solution that will pass
|
and yet still produce a solution that will pass
|
||||||
acceptance testing:
|
acceptance testing:
|
||||||
|
|
||||||
- There could be pre-existing failing tests in the repo,
|
- There may have been pre-existing failing tests in the repo,
|
||||||
before aider even starts working on the SWE Bench problem.
|
before aider even started working on the SWE Bench problem.
|
||||||
Aider may not resolve such issues, and yet they may turn out not to be
|
Aider may not have resolved such issues, and yet they may not to be
|
||||||
relevant to the acceptance testing.
|
relevant to the acceptance testing.
|
||||||
The SWE Bench acceptance testing just confirms that tests pass or fail
|
The SWE Bench acceptance testing just confirms that tests pass or fail
|
||||||
in the same pattern as the "gold patch" developed by a human to solve the
|
in the same pattern as the "gold patch" developed by a human to solve the
|
||||||
problem.
|
problem.
|
||||||
Some tests may still fail, and that's ok as long they fail for the gold
|
Some tests may still fail, and that's ok as long they fail for the gold
|
||||||
patch too.
|
patch too.
|
||||||
- There could be pre-existing linting problems in the repo,
|
- There may have been pre-existing linting problems in the repo.
|
||||||
which are in code paths that are irrelevant to the problem being solved
|
If they were in code paths that are irrelevant to the problem being solved
|
||||||
and to acceptance testing.
|
they might not affect acceptance testing.
|
||||||
If aider is unable to resolve them, the solution may still be valid
|
Even if aider was unable to resolve the linting errors,
|
||||||
and pass acceptance testing.
|
the solution may still be valid and pass acceptance testing.
|
||||||
- Aider may report editing errors because it doesn't think it was
|
- Aider may have reported file editing errors because it didn't think it was
|
||||||
able to successfully apply all the edits the LLM specified.
|
able to successfully apply all the edits the LLM specified.
|
||||||
In this scenario, the LLM has specified edits in an invalid
|
In this scenario, the LLM must have specified edits in an invalid
|
||||||
format that doesn't comply with its
|
format that doesn't comply with its
|
||||||
system prompt instructions.
|
system prompt instructions.
|
||||||
So it may be that the LLM was asking for redundant or otherwise
|
So it may be that the LLM was somewhat confused and was
|
||||||
irrelevant edits, such that outstanding edit errors are actually not fatal.
|
asking for redundant or otherwise
|
||||||
|
irrelevant edits.
|
||||||
|
Such outstanding edit errors might not be fatal for acceptance testing.
|
||||||
|
- Etc.
|
||||||
|
|
||||||
This is why the first row in the table above
|
Keeping this in mind, we can understand why
|
||||||
|
the first row in the table above
|
||||||
shows GPT-4o accounting for 15.3% of the benchmark score,
|
shows GPT-4o accounting for 15.3% of the benchmark score,
|
||||||
which is different than the 17.0% result reported earlier
|
less than the 17.0% result reported earlier in the article
|
||||||
for aider with just GPT-4o.
|
for just one attempt of aider with GPT-4o.
|
||||||
The second attempt from Opus may propose solutions which
|
When an Opus attempt is allowed, it may propose some *incorrect* solutions which
|
||||||
are "more plausible" than some of GPT-4's non-plausible solutions,
|
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
||||||
but which are actually incorrect solutions.
|
These more plausible, incorrect solutions can
|
||||||
These more plausible but incorrect solutions can
|
eclipse some of
|
||||||
eclipse the earlier non-plausible correct
|
the earlier non-plausible correct solutions that GPT-4o generated.
|
||||||
solution.
|
|
||||||
Luckily the full set of later attempts usually provide a net increase in the overall
|
For this reason, adding additional attempts is not guaranteed to monotonically
|
||||||
number of resolved solutions, as is the case here.
|
increase the number of resolved problems.
|
||||||
|
Luckily additional attempts usually provide a net increase in the overall
|
||||||
|
number of resolved solutions.
|
||||||
|
This was the case for both this main SWE Bench result and the
|
||||||
|
earlier Lite result.
|
||||||
|
|
||||||
The table below breaks down the plausibility of each solution proposed by
|
The table below breaks down the plausibility of each solution proposed by
|
||||||
aider with GPT-4o and with Opus, and indicates which were actually
|
aider with GPT-4o and with Opus, and indicates which were actually
|
||||||
correct solutions.
|
correct solutions.
|
||||||
|
|
||||||
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Count|
|
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Count|
|
||||||
|---:|--:|--:|--:|--:|--:|
|
|:--:|--:|--:|--:|--:|--:|
|
||||||
| 1 | plausible | resolved | n/a | n/a | 73 |
|
| A | plausible | resolved | n/a | n/a | 73 |
|
||||||
| 2 | plausible | not resolved | n/a | n/a | 181 |
|
| B | plausible | not resolved | n/a | n/a | 181 |
|
||||||
| 3 | non-plausible | resolved | plausible | resolved | 1 |
|
| C | non-plausible | resolved | plausible | resolved | 1 |
|
||||||
| 4 | non-plausible | resolved | plausible | not resolved | 2 |
|
| D | non-plausible | resolved | plausible | not resolved | 2 |
|
||||||
| 5 | non-plausible | resolved | non-plausible | resolved | 16 |
|
| E | non-plausible | resolved | non-plausible | resolved | 16 |
|
||||||
| 6 | non-plausible | resolved | non-plausible | not resolved | 5 |
|
| F | non-plausible | resolved | non-plausible | not resolved | 5 |
|
||||||
| 7 | non-plausible | not resolved | plausible | resolved | 12 |
|
| G | non-plausible | not resolved | non-plausible | resolved | 4 |
|
||||||
| 8 | non-plausible | not resolved | plausible | not resolved | 53 |
|
| H | non-plausible | not resolved | non-plausible | not resolved | 216 |
|
||||||
| 9 | non-plausible | not resolved | non-plausible | resolved | 4 |
|
| I | non-plausible | not resolved | plausible | resolved | 12 |
|
||||||
| 10 | non-plausible | not resolved | non-plausible | not resolved | 216 |
|
| J | non-plausible | not resolved | plausible | not resolved | 53 |
|
||||||
| 11 | non-plausible | not resolved | n/a | n/a | 7 |
|
| K | non-plausible | not resolved | n/a | n/a | 7 |
|
||||||
|
|
||||||
Rows 1-2 show the case where the first solution found
|
Rows A-B show the cases where
|
||||||
by aider with GPT-4o was plausible. Of those, 73 went on to be deemed as resolving the issue,
|
aider with GPT-4o found a plausible solution during the first attempt.
|
||||||
while 181 were not in fact correct solutions. Opus never got a try
|
Of those, 73 went on to be deemed as resolving the issue,
|
||||||
at solving these problems, because the harness stopped once a
|
while 181 were not in fact correct solutions.
|
||||||
|
The second attempt with Opus never happened,
|
||||||
|
because the harness stopped once a
|
||||||
plausible solution was found.
|
plausible solution was found.
|
||||||
|
|
||||||
The remaining rows consider cases where aider with GPT-4o
|
The remaining rows consider cases where aider with GPT-4o
|
||||||
did not find a plausible solution, so Opus got a turn to try and solve.
|
did not find a plausible solution, so Opus got a turn to try and solve.
|
||||||
Rows 3-6 are cases where GPT-4o's non-plausible solutions were
|
Rows C-F are cases where GPT-4o's non-plausible solutions were
|
||||||
actually found to be correct in hindsight,
|
actually found to be correct in hindsight.
|
||||||
but in row 4 we can see that aider with Opus overrides
|
In row D we can see the cases where aider with Opus overrides
|
||||||
2 of them with a plausible-but-incorrect
|
2 of them with plausible-but-incorrect
|
||||||
solution.
|
solutions.
|
||||||
|
|
||||||
In rows 5-6 & 9-10 we can see that both GPT-4o and Opus
|
In rows E-H we can see that both GPT-4o and Opus
|
||||||
produced non-plausible solutions,
|
produced non-plausible solutions.
|
||||||
and which one was selected has to do with the
|
Which one was ultimately selected has to do with the
|
||||||
[details about which solution the harness considered "most plausible"]().
|
[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
||||||
|
|
||||||
Row 11 contains cases where Opus returned errors due to context window
|
Rows I-J consider the simple cases where aider with GPT-4o
|
||||||
|
didn't find a plausible solution but Opus did.
|
||||||
|
Of these, Opus' solution went on to be deemed correct for 12 problems
|
||||||
|
and incorrect for 53.
|
||||||
|
|
||||||
|
Row K contains cases where Opus returned errors due to context window
|
||||||
exhaustion or other problems.
|
exhaustion or other problems.
|
||||||
In these cases aider with Opus was unable to produce any solutions.
|
In these cases aider with Opus was unable to produce any solutions.
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue