diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index d346dc9d1..beec36388 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -414,6 +414,7 @@ The "aider agent" internally makes multiple "attempts" at solving the problem, but it picks and returns one single candidate solution. Only that one candidate solution is evaluated with the acceptance tests and contributes to the benchmark score. + This is contrast to a pass@N result for N>1, where N attempts are made and all N solutions are evaluated by the acceptance tests. If *any* of the N solution pass, that counts as a pass@N success. diff --git a/_posts/2024-06-02-main-swe-bench.md b/_posts/2024-06-02-main-swe-bench.md index 4a9970fca..a4d963567 100644 --- a/_posts/2024-06-02-main-swe-bench.md +++ b/_posts/2024-06-02-main-swe-bench.md @@ -238,6 +238,7 @@ The "aider agent" internally makes multiple "attempts" at solving the problem, but it picks and returns one single candidate solution. Only that one candidate solution is evaluated with the acceptance tests and contributes to the benchmark score. + This is contrast to a pass@N result for N>1, where N attempts are made and all N solutions are evaluated by the acceptance tests. If *any* of the N solution pass, that counts as a pass@N success.