mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 02:05:00 +00:00
copy
This commit is contained in:
parent
f16e741bcb
commit
6094104b6c
1 changed files with 41 additions and 36 deletions
|
@ -75,22 +75,26 @@ correctly resolved.
|
||||||
This is the same approach
|
This is the same approach
|
||||||
that was used for
|
that was used for
|
||||||
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||||
Aider alternated between GPT-4o and Opus for up to 6 total attempts
|
For the Lite benchmark,
|
||||||
on the Lite benchmark.
|
aider alternated between GPT-4o and Opus for up to 6 total attempts.
|
||||||
Due to the increased token costs involved in running
|
Due to the increased token costs involved in running
|
||||||
the main SWE Bench benchmark, aider was limited to 2 total attempts.
|
the main SWE Bench benchmark, aider was limited to 2 total attempts:
|
||||||
Problems from the main SWE Bench dataset
|
one attempt of aider with GPT-4o and one with Opus.
|
||||||
are more difficult and involve edits to
|
|
||||||
more than one source file,
|
The problems from the main SWE Bench dataset
|
||||||
which increased the token costs of solving each problem.
|
are more difficult and involved edits to
|
||||||
|
multiple source files,
|
||||||
|
which increased the token costs as compared to Lite.
|
||||||
Further, aider was benchmarked on 570 SWE Bench problems
|
Further, aider was benchmarked on 570 SWE Bench problems
|
||||||
versus only 300 Lite problems,
|
versus only 300 Lite problems,
|
||||||
adding another factor of ~two to the costs.
|
adding another factor of ~two to the costs.
|
||||||
|
|
||||||
For a detailed discussion of the methodology, please see the
|
For a detailed discussion of the benchmark
|
||||||
|
methodology, please see the
|
||||||
[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||||
The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains
|
Also, the
|
||||||
the harness and analysis code used for the benchmarks.
|
[aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
|
||||||
|
contains the harness and statistics code used for the benchmarks.
|
||||||
|
|
||||||
The benchmarking process was similar to how a developer might use aider to
|
The benchmarking process was similar to how a developer might use aider to
|
||||||
resolve a GitHub issue:
|
resolve a GitHub issue:
|
||||||
|
@ -115,10 +119,11 @@ that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
## Aider with GPT-4o & Opus
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
The benchmark harness ran aider with GPT-4o to try
|
The benchmark harness started by using aider with GPT-4o to try
|
||||||
and solve the problem. If a plausible solution wasn't found,
|
and solve each problem.
|
||||||
it ran aider with Opus
|
For problems where this didn't produce a plausible solution,
|
||||||
to try and solve the problem.
|
the harness tried again using aider with Opus.
|
||||||
|
So at most, two attempts were made for each problem.
|
||||||
|
|
||||||
The table below breaks down the proposed solutions that
|
The table below breaks down the proposed solutions that
|
||||||
were found from each attempt at the 570 problems.
|
were found from each attempt at the 570 problems.
|
||||||
|
@ -160,24 +165,21 @@ Some tests may fail during acceptance testing,
|
||||||
and that's ok as long they failed for the gold
|
and that's ok as long they failed for the gold
|
||||||
patch too.
|
patch too.
|
||||||
- There may have been pre-existing linting problems in the repo.
|
- There may have been pre-existing linting problems in the repo.
|
||||||
If they were in code paths that are irrelevant to the problem being solved,
|
If lingering linting issues affected code paths that are not well tested,
|
||||||
then aider's failure to resolve them might not affect acceptance testing.
|
they may not impact acceptance testing.
|
||||||
- Aider may have reported file editing errors because it thought the LLM
|
- Aider may have reported file editing errors because it thought the LLM
|
||||||
specified edits that it wasn't able to successfully apply.
|
specified edits that it wasn't able to successfully apply.
|
||||||
In such a scenario, the LLM must have specified edits in
|
This can only happen when the LLM specified edits in
|
||||||
a way that doesn't comply with the edit format
|
a way that doesn't comply with the editing instructions in the system prompt.
|
||||||
specified in its system prompt.
|
Given that the LLM isn't complying with the system prompt,
|
||||||
Aider tries hard to deal with non-compliant LLM edits,
|
it may have become confused and
|
||||||
but still sometimes fails.
|
|
||||||
So the LLM may have become confused and
|
|
||||||
asked for redundant or otherwise irrelevant edits.
|
asked for redundant or otherwise irrelevant edits.
|
||||||
Such outstanding edit errors might not be fatal for acceptance testing.
|
Such outstanding edit errors might not be fatal for acceptance testing.
|
||||||
- Etc.
|
- Etc.
|
||||||
|
|
||||||
Keeping all this in mind, we can understand why
|
Keeping all this in mind, we can understand why
|
||||||
GPT-4o accounts for 15.3% of the benchmark score in the table above,
|
GPT-4o accounts for 15.3% of the benchmark score in the table above,
|
||||||
but we reported that
|
but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
|
||||||
just one attempt of aider with GPT-4o scored 17.0%.
|
|
||||||
When an Opus attempt is allowed after GPT-4o,
|
When an Opus attempt is allowed after GPT-4o,
|
||||||
it may propose some *incorrect* solutions which
|
it may propose some *incorrect* solutions which
|
||||||
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
||||||
|
@ -190,18 +192,18 @@ as compared to the results from just one try using aider with GPT-4o (17.0%).
|
||||||
|
|
||||||
For these reasons, adding additional attempts is not guaranteed to monotonically
|
For these reasons, adding additional attempts is not guaranteed to monotonically
|
||||||
increase the number of resolved problems.
|
increase the number of resolved problems.
|
||||||
The new solutions may solve some new problems but they may also
|
New solutions may solve some new problems but they may also
|
||||||
eclipse and discard some of the previous non-plausible correct solutions.
|
eclipse and discard some of the previous non-plausible correct solutions.
|
||||||
Luckily, additional attempts usually provide a net increase in the overall
|
Luckily, additional attempts usually provide a net increase in the overall
|
||||||
number of resolved solutions.
|
number of resolved solutions.
|
||||||
This was the case for both this main SWE Bench result and the
|
This was the case for both this main SWE Bench result and the
|
||||||
earlier Lite result.
|
earlier Lite result.
|
||||||
|
|
||||||
The table below breaks down the plausibility of each solution proposed by
|
The table below breaks down the benchmark outcome of each problem,
|
||||||
aider with GPT-4o and with Opus, and indicates which were actually
|
show whether aider with GPT-4o and with Opus
|
||||||
correct solutions.
|
produced plausible and/or correct solutions.
|
||||||
|
|
||||||
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Count|
|
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|
|
||||||
|:--:|--:|--:|--:|--:|--:|
|
|:--:|--:|--:|--:|--:|--:|
|
||||||
| A | plausible | resolved | n/a | n/a | 73 |
|
| A | plausible | resolved | n/a | n/a | 73 |
|
||||||
| B | plausible | not resolved | n/a | n/a | 181 |
|
| B | plausible | not resolved | n/a | n/a | 181 |
|
||||||
|
@ -214,6 +216,7 @@ correct solutions.
|
||||||
| I | non-plausible | not resolved | plausible | resolved | 12 |
|
| I | non-plausible | not resolved | plausible | resolved | 12 |
|
||||||
| J | non-plausible | not resolved | plausible | not resolved | 53 |
|
| J | non-plausible | not resolved | plausible | not resolved | 53 |
|
||||||
| K | non-plausible | not resolved | n/a | n/a | 7 |
|
| K | non-plausible | not resolved | n/a | n/a | 7 |
|
||||||
|
|Total|||||570|
|
||||||
|
|
||||||
Rows A-B show the cases where
|
Rows A-B show the cases where
|
||||||
aider with GPT-4o found a plausible solution during the first attempt.
|
aider with GPT-4o found a plausible solution during the first attempt.
|
||||||
|
@ -227,16 +230,17 @@ The remaining rows consider cases where aider with GPT-4o
|
||||||
did not find a plausible solution, so Opus got a turn to try and solve.
|
did not find a plausible solution, so Opus got a turn to try and solve.
|
||||||
Rows C-F are cases where GPT-4o's non-plausible solutions were
|
Rows C-F are cases where GPT-4o's non-plausible solutions were
|
||||||
actually found to be correct in hindsight.
|
actually found to be correct in hindsight.
|
||||||
In row D we can see the cases where aider with Opus overrides
|
In row D we can see the cases where aider with Opus
|
||||||
|
definitely overrides
|
||||||
2 of them with plausible-but-incorrect
|
2 of them with plausible-but-incorrect
|
||||||
solutions.
|
solutions.
|
||||||
|
|
||||||
In rows E-H we can see that both GPT-4o and Opus
|
In rows E-H we can see that both GPT-4o and Opus
|
||||||
produced non-plausible solutions.
|
produced non-plausible solutions.
|
||||||
Which one was ultimately selected has to do with the
|
Which one was ultimately selected for each problem depends on
|
||||||
[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
[details about which solution the harness considered "most plausible"](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
||||||
|
|
||||||
Rows I-J consider the simple cases where aider with GPT-4o
|
Rows I-J consider the straightforward cases where aider with GPT-4o
|
||||||
didn't find a plausible solution but Opus did.
|
didn't find a plausible solution but Opus did.
|
||||||
Of these, Opus' solution went on to be deemed correct for 12 problems
|
Of these, Opus' solution went on to be deemed correct for 12 problems
|
||||||
and incorrect for 53.
|
and incorrect for 53.
|
||||||
|
@ -247,7 +251,7 @@ In these cases aider with Opus was unable to produce any solutions.
|
||||||
|
|
||||||
## Computing the benchmark score
|
## Computing the benchmark score
|
||||||
|
|
||||||
Benchmarking produced one candidate solution for each of
|
Benchmarking produced one proposed solution for each of
|
||||||
the 570 SWE Bench problems.
|
the 570 SWE Bench problems.
|
||||||
|
|
||||||
A separate evaluation script was used to
|
A separate evaluation script was used to
|
||||||
|
@ -257,10 +261,10 @@ For this final acceptance testing, any edits that aider made to tests
|
||||||
were discarded.
|
were discarded.
|
||||||
This ensured that the correct,
|
This ensured that the correct,
|
||||||
unmodified test suite was used for acceptance testing.
|
unmodified test suite was used for acceptance testing.
|
||||||
The evaluation script compared each candidate solution's test results
|
The evaluation script compared each proposed solution's test results
|
||||||
with results from testing
|
with results from testing
|
||||||
the "gold" patch that was developed by a human to correctly solve the issue.
|
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||||
If they matched, the candidate solution correctly resolved the issue.
|
If they matched, the proposed solution correctly resolved the issue.
|
||||||
|
|
||||||
These acceptance tests were only ever run outside of aider
|
These acceptance tests were only ever run outside of aider
|
||||||
and the benchmark harness, and only to compute statistics about the
|
and the benchmark harness, and only to compute statistics about the
|
||||||
|
@ -299,7 +303,8 @@ Table 2 of their
|
||||||
[paper](https://arxiv.org/pdf/2404.05427v2)
|
[paper](https://arxiv.org/pdf/2404.05427v2)
|
||||||
reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
|
reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
|
||||||
|
|
||||||
The [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
The results presented here for aider are all pass@1, as
|
||||||
|
the [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
||||||
only accepts pass@1 results.
|
only accepts pass@1 results.
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue