This commit is contained in:
Paul Gauthier 2024-05-23 06:46:26 -07:00
parent 2852acb947
commit 071e5e273b

View file

@ -89,7 +89,7 @@ This was itself a state-of-the-art result, before being surpassed by the main
result being reported here result being reported here
that used aider with both GPT-4o & Opus. that used aider with both GPT-4o & Opus.
## GPT-4o vs Opus ## Aider with GPT-4o & Opus
The benchmark harness alternated between running aider with GPT-4o and Opus. The benchmark harness alternated between running aider with GPT-4o and Opus.
The harness proceeded in a fixed order, always starting with GPT-4o and The harness proceeded in a fixed order, always starting with GPT-4o and
@ -99,35 +99,39 @@ The table below breaks down the 79 solutions that were ultimately
verified as correctly resolving their issue. verified as correctly resolving their issue.
Some noteworthy observations: Some noteworthy observations:
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt. - Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resulted problems.
- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus. - ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus.
- A long tail of solutions continued to be found by both models including one on the final, sixth attempt of that problem. - A long tail of solutions continued to be found by both models including one resolved solution on the final, sixth attempt of that problem.
| Attempt | Agent | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved | | Attempt | Agent |Number<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent<br>of correctly<br>resolved |
|:--------:|------------|---------:|---------:|----:| |:--------:|------------|---------:|---------:|----:|---:|
| 1 | Aider with GPT-4o | 61 | 77.2 | 77.2 | 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% |
| 2 | Aider with Opus | 10 | 12.7 | 89.9 | 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% |
| 3 | Aider with GPT-4o | 3 | 3.8 | 93.7 | 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% |
| 4 | Aider with Opus | 2 | 2.5 | 96.2 | 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% |
| 5 | Aider with GPT-4o | 2 | 2.5 | 98.7 | 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% |
| 6 | Aider with Opus | 1 | 1.3 | 100.0 | 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% |
|**Total**| | **79** | **100%** | **100%** | | **Total** | | **300** | **100%** | **79** | **100%** |
If we break down correct solutions purely by model, If we break down correct solutions purely by model,
we can see that GPT-4o dominates. we can see that aider with GPT-4o outperforms Opus.
This isn't a fair and direct comparison, because GPT-4o always took the first This isn't a fair and direct comparison, because GPT-4o always took the first
turn at solving. turn at solving and therefore got to solve all the "easiest" problems.
But anecdotal evidence from earlier runs of the benchmark Aider with Opus only ever saw the problems that GPT-4o failed to solve on the first attempt.
supports the observation that aider with GPT-4o is significantly stronger than Opus
Aider with GPT-4o was producing higher quality plausible solutions,
with a greater chance of going on to be accepted as resolving the issue.
Other anecdotal evidence from earlier runs of the benchmark
also supports the observation that aider with GPT-4o is significantly stronger than Opus
for this endeavor. for this endeavor.
| Agent | Number resolved | Percent of resolved |
|------------|---------:|---------:|
| Aider with GPT-4o | 66 | 83.5 |
| Aider with Opus | 13 | 16.5 |
|**Total**| **79** | **100%** |
| Agent | Number<br>plausible<br>solutions | Number<br>correctly<br>resolved | Percent<br>plausible<br>which<br>resolved|
|------------|---------:|---------:|---:|
| Aider with GPT-4o | 239 | 66 |27.6% |
| Aider with Opus | 61 | 13 |21.3% |
| **Total** | **300** | **79** |**26.3%** |
## Repository map, not RAG ## Repository map, not RAG
@ -171,14 +175,18 @@ Please add app.py to the chat so I can proceed with the changes.
This is a convenient and natural workflow for interactive chat, This is a convenient and natural workflow for interactive chat,
and it worked well for the SWE Bench tasks. and it worked well for the SWE Bench tasks.
Each task comes with a “gold” patch, which was created by a human developer Aider successfully identified the correct file to edit
to solve the issue.
Aider successfully identified and added the file from the gold patch
in 70.3% of the benchmark tasks. in 70.3% of the benchmark tasks.
We can determine which file needed to be edited using the "gold" patch
which is associated with SWE Bench Task.
This patch was created by a human developer
to solve the issue, and therefore reveals a file which can
be edited to solve the problem.
Of course aider is not able to see or use the gold patch Of course aider is not able to see or use the gold patch
or the files it names in any way. or the file names it contains in any way.
They were only used to compute this statistic after the benchmarking was completed. This information was only used to compute
statistics after the benchmarking was completed.
## Reliable code editing ## Reliable code editing
@ -186,13 +194,13 @@ They were only used to compute this statistic after the benchmarking was complet
Once files have been selected for editing, Once files have been selected for editing,
the next step is of course to edit the source code to fix the problem. the next step is of course to edit the source code to fix the problem.
Aider has always had a deep focus on ensuring that LLMs can not just write code, Aider goes to great lengths to ensure that LLMs can not just write code,
but reliably *edit* code. but reliably *edit* code.
Aider has a collection of prompting strategies and code editing backends which have Aider has a collection of prompting strategies and code editing backends which have
been honed through been honed through
[extensive benchmarking](https://aider.chat/docs/leaderboards/). [extensive benchmarking](https://aider.chat/docs/leaderboards/).
These foundational capabilities help ensure that the LLM can not only code up a solution but These foundational capabilities help ensure that aider can
also properly integrate it into the existing code base and source files. properly integrate code from LLMs into an existing code base and source files.
The repository map helps here too, making sure that the LLM The repository map helps here too, making sure that the LLM
can see relevant classes, functions and variables from the entire repo. can see relevant classes, functions and variables from the entire repo.
@ -293,7 +301,7 @@ described in (3).
Those tests are only run outside of aider and the benchmark harness, Those tests are only run outside of aider and the benchmark harness,
to compute the final benchmark score. to compute the final benchmark score.
To do that, To do that,
the SWE Bench support code an evaluation script
verifies that the pre-existing and held out tests verifies that the pre-existing and held out tests
pass as expected from a correct solution. pass as expected from a correct solution.
If so, the issue is marked as resolved. If so, the issue is marked as resolved.
@ -342,10 +350,12 @@ and prioritizing solutions in the following order:
## Computing the benchmark score ## Computing the benchmark score
The benchmark harness produces one "best" solution for each of the 300 The benchmark harness produces one candidate solution for each of the 300
SWE Bench Lite instances and saves it as a `model_patch`. SWE Bench Lite instances and saves it as a `model_patch`.
A separate evaluation script uses the SWE Bench support code to A separate evaluation script
test each of these results with the acceptance tests. tests each of these results with the acceptance tests.
It verifies that they pass as expected from a correct solution, like
the "gold" patch developed by a human to solve the issue.
These `test_patch` acceptance tests are only ever run outside of aider These `test_patch` acceptance tests are only ever run outside of aider
and the benchmark harness, and only to compute the number of and the benchmark harness, and only to compute the number of