pass@1

2025-06-02 18:54:59 +00:00 · 2024-06-03 10:55:07 -07:00 · 2024-06-03 10:55:07 -07:00 · 4770e0ffc0
commit 4770e0ffc0
parent 4753db0b0f
2 changed files with 18 additions and 0 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -13,6 +13,15 @@ achieving a state-of-the-art result.
 The previous top leaderboard entry was 20.3%
 from Amazon Q Developer Agent.

+**To be clear, all of aider's results reported here are pass@1 results.**
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N pass, that counts as a pass@N success.
+
 [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)

 Please see the [references](#references)
--- a/_posts/2024-06-02-main-swe-bench.md
+++ b/_posts/2024-06-02-main-swe-bench.md
@ -18,6 +18,15 @@ The best result reported elsewhere seems to be
 This result on the main SWE Bench builds on
 [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).

+**To be clear, all of aider's results reported here are pass@1 results.**
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N pass, that counts as a pass@N success.
+
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)

 Aider was benchmarked on the same