mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 02:05:00 +00:00
copy
This commit is contained in:
parent
2852acb947
commit
071e5e273b
1 changed files with 44 additions and 34 deletions
|
@ -89,7 +89,7 @@ This was itself a state-of-the-art result, before being surpassed by the main
|
||||||
result being reported here
|
result being reported here
|
||||||
that used aider with both GPT-4o & Opus.
|
that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
## GPT-4o vs Opus
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||||
The harness proceeded in a fixed order, always starting with GPT-4o and
|
The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||||
|
@ -99,35 +99,39 @@ The table below breaks down the 79 solutions that were ultimately
|
||||||
verified as correctly resolving their issue.
|
verified as correctly resolving their issue.
|
||||||
Some noteworthy observations:
|
Some noteworthy observations:
|
||||||
|
|
||||||
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
|
- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resulted problems.
|
||||||
- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
|
- ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus.
|
||||||
- A long tail of solutions continued to be found by both models including one on the final, sixth attempt of that problem.
|
- A long tail of solutions continued to be found by both models including one resolved solution on the final, sixth attempt of that problem.
|
||||||
|
|
||||||
|
|
||||||
| Attempt | Agent | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved |
|
| Attempt | Agent |Number<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent<br>of correctly<br>resolved |
|
||||||
|:--------:|------------|---------:|---------:|----:|
|
|:--------:|------------|---------:|---------:|----:|---:|
|
||||||
| 1 | Aider with GPT-4o | 61 | 77.2 | 77.2
|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% |
|
||||||
| 2 | Aider with Opus | 10 | 12.7 | 89.9
|
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% |
|
||||||
| 3 | Aider with GPT-4o | 3 | 3.8 | 93.7
|
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% |
|
||||||
| 4 | Aider with Opus | 2 | 2.5 | 96.2
|
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% |
|
||||||
| 5 | Aider with GPT-4o | 2 | 2.5 | 98.7
|
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% |
|
||||||
| 6 | Aider with Opus | 1 | 1.3 | 100.0
|
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% |
|
||||||
|**Total**| | **79** | **100%** | **100%** |
|
| **Total** | | **300** | **100%** | **79** | **100%** |
|
||||||
|
|
||||||
If we break down correct solutions purely by model,
|
If we break down correct solutions purely by model,
|
||||||
we can see that GPT-4o dominates.
|
we can see that aider with GPT-4o outperforms Opus.
|
||||||
This isn't a fair and direct comparison, because GPT-4o always took the first
|
This isn't a fair and direct comparison, because GPT-4o always took the first
|
||||||
turn at solving.
|
turn at solving and therefore got to solve all the "easiest" problems.
|
||||||
But anecdotal evidence from earlier runs of the benchmark
|
Aider with Opus only ever saw the problems that GPT-4o failed to solve on the first attempt.
|
||||||
supports the observation that aider with GPT-4o is significantly stronger than Opus
|
|
||||||
|
Aider with GPT-4o was producing higher quality plausible solutions,
|
||||||
|
with a greater chance of going on to be accepted as resolving the issue.
|
||||||
|
Other anecdotal evidence from earlier runs of the benchmark
|
||||||
|
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||||
for this endeavor.
|
for this endeavor.
|
||||||
|
|
||||||
| Agent | Number resolved | Percent of resolved |
|
|
||||||
|------------|---------:|---------:|
|
|
||||||
| Aider with GPT-4o | 66 | 83.5 |
|
|
||||||
| Aider with Opus | 13 | 16.5 |
|
|
||||||
|**Total**| **79** | **100%** |
|
|
||||||
|
|
||||||
|
| Agent | Number<br>plausible<br>solutions | Number<br>correctly<br>resolved | Percent<br>plausible<br>which<br>resolved|
|
||||||
|
|------------|---------:|---------:|---:|
|
||||||
|
| Aider with GPT-4o | 239 | 66 |27.6% |
|
||||||
|
| Aider with Opus | 61 | 13 |21.3% |
|
||||||
|
| **Total** | **300** | **79** |**26.3%** |
|
||||||
|
|
||||||
## Repository map, not RAG
|
## Repository map, not RAG
|
||||||
|
|
||||||
|
@ -171,14 +175,18 @@ Please add app.py to the chat so I can proceed with the changes.
|
||||||
|
|
||||||
This is a convenient and natural workflow for interactive chat,
|
This is a convenient and natural workflow for interactive chat,
|
||||||
and it worked well for the SWE Bench tasks.
|
and it worked well for the SWE Bench tasks.
|
||||||
Each task comes with a “gold” patch, which was created by a human developer
|
Aider successfully identified the correct file to edit
|
||||||
to solve the issue.
|
|
||||||
Aider successfully identified and added the file from the gold patch
|
|
||||||
in 70.3% of the benchmark tasks.
|
in 70.3% of the benchmark tasks.
|
||||||
|
|
||||||
|
We can determine which file needed to be edited using the "gold" patch
|
||||||
|
which is associated with SWE Bench Task.
|
||||||
|
This patch was created by a human developer
|
||||||
|
to solve the issue, and therefore reveals a file which can
|
||||||
|
be edited to solve the problem.
|
||||||
Of course aider is not able to see or use the gold patch
|
Of course aider is not able to see or use the gold patch
|
||||||
or the files it names in any way.
|
or the file names it contains in any way.
|
||||||
They were only used to compute this statistic after the benchmarking was completed.
|
This information was only used to compute
|
||||||
|
statistics after the benchmarking was completed.
|
||||||
|
|
||||||
|
|
||||||
## Reliable code editing
|
## Reliable code editing
|
||||||
|
@ -186,13 +194,13 @@ They were only used to compute this statistic after the benchmarking was complet
|
||||||
Once files have been selected for editing,
|
Once files have been selected for editing,
|
||||||
the next step is of course to edit the source code to fix the problem.
|
the next step is of course to edit the source code to fix the problem.
|
||||||
|
|
||||||
Aider has always had a deep focus on ensuring that LLMs can not just write code,
|
Aider goes to great lengths to ensure that LLMs can not just write code,
|
||||||
but reliably *edit* code.
|
but reliably *edit* code.
|
||||||
Aider has a collection of prompting strategies and code editing backends which have
|
Aider has a collection of prompting strategies and code editing backends which have
|
||||||
been honed through
|
been honed through
|
||||||
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
||||||
These foundational capabilities help ensure that the LLM can not only code up a solution but
|
These foundational capabilities help ensure that aider can
|
||||||
also properly integrate it into the existing code base and source files.
|
properly integrate code from LLMs into an existing code base and source files.
|
||||||
|
|
||||||
The repository map helps here too, making sure that the LLM
|
The repository map helps here too, making sure that the LLM
|
||||||
can see relevant classes, functions and variables from the entire repo.
|
can see relevant classes, functions and variables from the entire repo.
|
||||||
|
@ -293,7 +301,7 @@ described in (3).
|
||||||
Those tests are only run outside of aider and the benchmark harness,
|
Those tests are only run outside of aider and the benchmark harness,
|
||||||
to compute the final benchmark score.
|
to compute the final benchmark score.
|
||||||
To do that,
|
To do that,
|
||||||
the SWE Bench support code
|
an evaluation script
|
||||||
verifies that the pre-existing and held out tests
|
verifies that the pre-existing and held out tests
|
||||||
pass as expected from a correct solution.
|
pass as expected from a correct solution.
|
||||||
If so, the issue is marked as resolved.
|
If so, the issue is marked as resolved.
|
||||||
|
@ -342,10 +350,12 @@ and prioritizing solutions in the following order:
|
||||||
|
|
||||||
## Computing the benchmark score
|
## Computing the benchmark score
|
||||||
|
|
||||||
The benchmark harness produces one "best" solution for each of the 300
|
The benchmark harness produces one candidate solution for each of the 300
|
||||||
SWE Bench Lite instances and saves it as a `model_patch`.
|
SWE Bench Lite instances and saves it as a `model_patch`.
|
||||||
A separate evaluation script uses the SWE Bench support code to
|
A separate evaluation script
|
||||||
test each of these results with the acceptance tests.
|
tests each of these results with the acceptance tests.
|
||||||
|
It verifies that they pass as expected from a correct solution, like
|
||||||
|
the "gold" patch developed by a human to solve the issue.
|
||||||
|
|
||||||
These `test_patch` acceptance tests are only ever run outside of aider
|
These `test_patch` acceptance tests are only ever run outside of aider
|
||||||
and the benchmark harness, and only to compute the number of
|
and the benchmark harness, and only to compute the number of
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue