copy

2025-05-31 01:35:00 +00:00 · 2024-06-03 11:12:25 -07:00 · 2024-06-03 11:12:25 -07:00 · 8a8f3936f4
commit 8a8f3936f4
parent 4770e0ffc0
2 changed files with 29 additions and 20 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -13,17 +13,11 @@ achieving a state-of-the-art result.
 The previous top leaderboard entry was 20.3%
 from Amazon Q Developer Agent.
 **To be clear, all of aider's results reported here are pass@1 results.**
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N pass, that counts as a pass@N success.
 [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
 **To be clear, all of aider's results reported here are pass@1 results,
 obtained without using the SWE Bench `hints_text`.**
 All results in the above chart are unhinted pass@1 results.
 Please see the [references](#references)
 for details on the data presented in this chart.
 It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
@ -413,7 +407,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
 ## References
-Below are the references for the SWE-Bench Lite results
+To be clear, all of aider's results reported here are pass@1 results,
 obtained without using the SWE Bench `hints_text`.
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N solution pass, that counts as a pass@N success.
 Below are the references for the pass@1 unhinted SWE-Bench results
 displayed in the graph at the beginning of this article.
 - [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
--- a/_posts/2024-06-02-main-swe-bench.md
+++ b/_posts/2024-06-02-main-swe-bench.md
@ -18,17 +18,10 @@ The best result reported elsewhere seems to be
 This result on the main SWE Bench builds on
 [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
 **To be clear, all of aider's results reported here are pass@1 results.**
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N pass, that counts as a pass@N success.
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
 **To be clear, all of aider's results reported here are pass@1 results,
 obtained without using the SWE Bench `hints_text`.**
 Aider was benchmarked on the same
 [570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
 that were used in the
@ -238,7 +231,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
 ## References
-Below are the references for the SWE-Bench results
+To be clear, all of aider's results reported here are pass@1 results,
 obtained without using the SWE Bench `hints_text`.
 The "aider agent" internally makes multiple "attempts" at solving the problem,
 but it picks and returns one single candidate solution.
 Only that one candidate solution is evaluated with the acceptance tests
 and contributes to the benchmark score.
 This is contrast to a pass@N result for N>1, where N attempts are made
 and all N solutions are evaluated by the acceptance tests.
 If *any* of the N solution pass, that counts as a pass@N success.
 Below are the references for the pass@1 unhinted SWE-Bench results
 displayed in the graph at the beginning of this article.
 - [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)