mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-02 18:54:59 +00:00
pass@1
This commit is contained in:
parent
4753db0b0f
commit
4770e0ffc0
2 changed files with 18 additions and 0 deletions
|
@ -13,6 +13,15 @@ achieving a state-of-the-art result.
|
|||
The previous top leaderboard entry was 20.3%
|
||||
from Amazon Q Developer Agent.
|
||||
|
||||
**To be clear, all of aider's results reported here are pass@1 results.**
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N pass, that counts as a pass@N success.
|
||||
|
||||
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||
|
||||
Please see the [references](#references)
|
||||
|
|
|
@ -18,6 +18,15 @@ The best result reported elsewhere seems to be
|
|||
This result on the main SWE Bench builds on
|
||||
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
|
||||
**To be clear, all of aider's results reported here are pass@1 results.**
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N pass, that counts as a pass@N success.
|
||||
|
||||
[](https://aider.chat/assets/swe_bench.svg)
|
||||
|
||||
Aider was benchmarked on the same
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue