diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index 120119ad5..b97309725 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -13,6 +13,15 @@ achieving a state-of-the-art result. The previous top leaderboard entry was 20.3% from Amazon Q Developer Agent. +**To be clear, all of aider's results reported here are pass@1 results.** +The "aider agent" internally makes multiple "attempts" at solving the problem, +but it picks and returns one single candidate solution. +Only that one candidate solution is evaluated with the acceptance tests +and contributes to the benchmark score. +This is contrast to a pass@N result for N>1, where N attempts are made +and all N solutions are evaluated by the acceptance tests. +If *any* of the N pass, that counts as a pass@N success. + [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg) Please see the [references](#references) diff --git a/_posts/2024-06-02-main-swe-bench.md b/_posts/2024-06-02-main-swe-bench.md index b6ab104ac..a3179e9f7 100644 --- a/_posts/2024-06-02-main-swe-bench.md +++ b/_posts/2024-06-02-main-swe-bench.md @@ -18,6 +18,15 @@ The best result reported elsewhere seems to be This result on the main SWE Bench builds on [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). +**To be clear, all of aider's results reported here are pass@1 results.** +The "aider agent" internally makes multiple "attempts" at solving the problem, +but it picks and returns one single candidate solution. +Only that one candidate solution is evaluated with the acceptance tests +and contributes to the benchmark score. +This is contrast to a pass@N result for N>1, where N attempts are made +and all N solutions are evaluated by the acceptance tests. +If *any* of the N pass, that counts as a pass@N success. + [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg) Aider was benchmarked on the same