copy

2025-05-30 01:04:59 +00:00 · 2024-06-03 11:12:25 -07:00 · 2024-06-03 11:12:25 -07:00 · 8a8f3936f4
commit 8a8f3936f4
parent 4770e0ffc0
2 changed files with 29 additions and 20 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -13,17 +13,11 @@ achieving a state-of-the-art result.
 The previous top leaderboard entry was 20.3%
 from Amazon Q Developer Agent.

-**To be clear, all of aider's results reported here are pass@1 results.**
-The "aider agent" internally makes multiple "attempts" at solving the problem,
-but it picks and returns one single candidate solution.
-Only that one candidate solution is evaluated with the acceptance tests
-and contributes to the benchmark score.
-This is contrast to a pass@N result for N>1, where N attempts are made
-and all N solutions are evaluated by the acceptance tests.
-If *any* of the N pass, that counts as a pass@N success.
-
 [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)

+**To be clear, all of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.**
+All results in the above chart are unhinted pass@1 results.
 Please see the [references](#references)
 for details on the data presented in this chart.
 It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
@ -413,7 +407,18 @@ making it faster, easier, and more reliable to run the acceptance tests.

 ## References

-Below are the references for the SWE-Bench Lite results
+To be clear, all of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.
+
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N solution pass, that counts as a pass@N success.
+
+Below are the references for the pass@1 unhinted SWE-Bench results
 displayed in the graph at the beginning of this article.

 - [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
--- a/_posts/2024-06-02-main-swe-bench.md
+++ b/_posts/2024-06-02-main-swe-bench.md
@ -18,17 +18,10 @@ The best result reported elsewhere seems to be
 This result on the main SWE Bench builds on
 [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).

-**To be clear, all of aider's results reported here are pass@1 results.**
-The "aider agent" internally makes multiple "attempts" at solving the problem,
-but it picks and returns one single candidate solution.
-Only that one candidate solution is evaluated with the acceptance tests
-and contributes to the benchmark score.
-This is contrast to a pass@N result for N>1, where N attempts are made
-and all N solutions are evaluated by the acceptance tests.
-If *any* of the N pass, that counts as a pass@N success.
-
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)

+**To be clear, all of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.**
 Aider was benchmarked on the same
 [570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
 that were used in the
@ -238,7 +231,18 @@ making it faster, easier, and more reliable to run the acceptance tests.

 ## References

-Below are the references for the SWE-Bench results
+To be clear, all of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.
+
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N solution pass, that counts as a pass@N success.
+
+Below are the references for the pass@1 unhinted SWE-Bench results
 displayed in the graph at the beginning of this article.

 - [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)