pass@1

2025-06-06 04:35:00 +00:00 · 2024-06-03 10:55:07 -07:00 · 2024-06-03 10:55:07 -07:00 · 4770e0ffc0
commit 4770e0ffc0
parent 4753db0b0f
2 changed files with 18 additions and 0 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -13,6 +13,15 @@ achieving a state-of-the-art result.
 The previous top leaderboard entry was 20.3%
 from Amazon Q Developer Agent.
 **To be clear, all of aider's results reported here are pass@1 results.**
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N pass, that counts as a pass@N success.
 [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
 Please see the [references](#references)
--- a/_posts/2024-06-02-main-swe-bench.md
+++ b/_posts/2024-06-02-main-swe-bench.md
@ -18,6 +18,15 @@ The best result reported elsewhere seems to be
 This result on the main SWE Bench builds on
 [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
 **To be clear, all of aider's results reported here are pass@1 results.**
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N pass, that counts as a pass@N success.
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
 Aider was benchmarked on the same