This commit is contained in:
Paul Gauthier 2024-06-03 10:55:07 -07:00
parent 4753db0b0f
commit 4770e0ffc0
2 changed files with 18 additions and 0 deletions

View file

@ -13,6 +13,15 @@ achieving a state-of-the-art result.
The previous top leaderboard entry was 20.3%
from Amazon Q Developer Agent.
**To be clear, all of aider's results reported here are pass@1 results.**
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N pass, that counts as a pass@N success.
[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
Please see the [references](#references)

View file

@ -18,6 +18,15 @@ The best result reported elsewhere seems to be
This result on the main SWE Bench builds on
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
**To be clear, all of aider's results reported here are pass@1 results.**
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N pass, that counts as a pass@N success.
[![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
Aider was benchmarked on the same