This commit is contained in:
Paul Gauthier 2024-06-01 15:05:29 -07:00
parent 2febc663f3
commit 47a3cb8adf
4 changed files with 135 additions and 105 deletions

View file

@ -7,7 +7,7 @@ draft: true
# Aider is SOTA for both SWE Bench and SWE Bench Lite
Aider scored 18.8%
Aider scored 18.9%
on the main
[SWE Bench benchmark](https://www.swebench.com),
achieving a state-of-the-art result.
@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing.
- Or, the "most plausible" solution generated by either attempt, with the
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
The table also provides details on the 107 solutions that were ultimately
The table also provides details on the 108 solutions that were ultimately
verified as correctly resolving their issue.
| Attempt | Agent |Number&nbsp;of<br>proposed<br>solutions|Percent&nbsp;of<br>proposed<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
|:--------:|------------|---------:|---------:|----:|---:|--:|
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% |
| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% |
| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
## Non-plausible but correct solutions?
@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus
produced plausible and/or correct solutions.
|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|Number of<br>problems<br>resolved|
|:--:|--:|--:|--:|--:|--:|--:|
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
| B | **plausible** | not resolved | n/a | n/a | 181 | 0 |
| C | non-plausible | **resolved** | **plausible** | **resolved** | 1 | 1 |
| D | non-plausible | **resolved** | **plausible** | not resolved | 2 | 0 |
| E | non-plausible | not resolved | **plausible** | **resolved** | 12 | 12 |
| F | non-plausible | not resolved | **plausible** | not resolved | 53 | 0 |
| G | non-plausible | **resolved** | non-plausible | **resolved** | 16 | 16 |
| H | non-plausible | **resolved** | non-plausible | not resolved | 5 | 3 |
| I | non-plausible | not resolved | non-plausible | **resolved** | 4 | 2 |
| J | non-plausible | not resolved | non-plausible | not resolved | 216 | 0 |
| K | non-plausible | not resolved | n/a | n/a | 7 | 0 |
|Total|||||570|107|
|:--:|:--:|:--:|:--:|:--:|--:|--:|
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
| B | **plausible** | no | n/a | n/a | 181 | 0 |
| C | no | no | **plausible** | no | 53 | 0 |
| D | no | no | **plausible** | **resolved** | 12 | 12 |
| E | no | **resolved** | **plausible** | no | 2 | 0 |
| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 |
| G | no | no | no | no | 216 | 0 |
| H | no | no | no | **resolved** | 4 | 2 |
| I | no | **resolved** | no | no | 4 | 3 |
| J | no | **resolved** | no | **resolved** | 17 | 17 |
| K | no | no | n/a | n/a | 7 | 0 |
|Total|||||570|108|
Rows A-B show the cases where
aider with GPT-4o found a plausible solution during the first attempt.
@ -233,7 +233,7 @@ So Opus' solutions were adopted and they
went on to be deemed correct for 13 problems
and incorrect for 55.
Row D is an interesting special case, where GPT-4o found 2
In that group, Row E is an interesting special case, where GPT-4o found 2
non-plausible but correct solutions.
We can see that Opus overrides
them with plausible-but-incorrect
@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the
correctly resolved instances.
They were never run, used, or even visible during aider's attempts to resolve the problems.
Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
or 18.8%.
Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
or 18.9%.
## Acknowledgments