mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-03 19:24:59 +00:00
copy
This commit is contained in:
parent
2febc663f3
commit
47a3cb8adf
4 changed files with 135 additions and 105 deletions
|
@ -7,7 +7,7 @@ draft: true
|
|||
|
||||
# Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||
|
||||
Aider scored 18.8%
|
||||
Aider scored 18.9%
|
||||
on the main
|
||||
[SWE Bench benchmark](https://www.swebench.com),
|
||||
achieving a state-of-the-art result.
|
||||
|
@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing.
|
|||
- Or, the "most plausible" solution generated by either attempt, with the
|
||||
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
||||
|
||||
The table also provides details on the 107 solutions that were ultimately
|
||||
The table also provides details on the 108 solutions that were ultimately
|
||||
verified as correctly resolving their issue.
|
||||
|
||||
| Attempt | Agent |Number of<br>proposed<br>solutions|Percent of<br>proposed<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite |
|
||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% |
|
||||
| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% |
|
||||
| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
|
||||
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
|
||||
| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
|
||||
| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
|
||||
|
||||
## Non-plausible but correct solutions?
|
||||
|
||||
|
@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus
|
|||
produced plausible and/or correct solutions.
|
||||
|
||||
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|Number of<br>problems<br>resolved|
|
||||
|:--:|--:|--:|--:|--:|--:|--:|
|
||||
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
|
||||
| B | **plausible** | not resolved | n/a | n/a | 181 | 0 |
|
||||
| C | non-plausible | **resolved** | **plausible** | **resolved** | 1 | 1 |
|
||||
| D | non-plausible | **resolved** | **plausible** | not resolved | 2 | 0 |
|
||||
| E | non-plausible | not resolved | **plausible** | **resolved** | 12 | 12 |
|
||||
| F | non-plausible | not resolved | **plausible** | not resolved | 53 | 0 |
|
||||
| G | non-plausible | **resolved** | non-plausible | **resolved** | 16 | 16 |
|
||||
| H | non-plausible | **resolved** | non-plausible | not resolved | 5 | 3 |
|
||||
| I | non-plausible | not resolved | non-plausible | **resolved** | 4 | 2 |
|
||||
| J | non-plausible | not resolved | non-plausible | not resolved | 216 | 0 |
|
||||
| K | non-plausible | not resolved | n/a | n/a | 7 | 0 |
|
||||
|Total|||||570|107|
|
||||
|:--:|:--:|:--:|:--:|:--:|--:|--:|
|
||||
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
|
||||
| B | **plausible** | no | n/a | n/a | 181 | 0 |
|
||||
| C | no | no | **plausible** | no | 53 | 0 |
|
||||
| D | no | no | **plausible** | **resolved** | 12 | 12 |
|
||||
| E | no | **resolved** | **plausible** | no | 2 | 0 |
|
||||
| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 |
|
||||
| G | no | no | no | no | 216 | 0 |
|
||||
| H | no | no | no | **resolved** | 4 | 2 |
|
||||
| I | no | **resolved** | no | no | 4 | 3 |
|
||||
| J | no | **resolved** | no | **resolved** | 17 | 17 |
|
||||
| K | no | no | n/a | n/a | 7 | 0 |
|
||||
|Total|||||570|108|
|
||||
|
||||
Rows A-B show the cases where
|
||||
aider with GPT-4o found a plausible solution during the first attempt.
|
||||
|
@ -233,7 +233,7 @@ So Opus' solutions were adopted and they
|
|||
went on to be deemed correct for 13 problems
|
||||
and incorrect for 55.
|
||||
|
||||
Row D is an interesting special case, where GPT-4o found 2
|
||||
In that group, Row E is an interesting special case, where GPT-4o found 2
|
||||
non-plausible but correct solutions.
|
||||
We can see that Opus overrides
|
||||
them with plausible-but-incorrect
|
||||
|
@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the
|
|||
correctly resolved instances.
|
||||
They were never run, used, or even visible during aider's attempts to resolve the problems.
|
||||
|
||||
Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
|
||||
or 18.8%.
|
||||
Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
|
||||
or 18.9%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue