From 8a8f3936f45c11fd2e96288841ac633b4524ae52 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Mon, 3 Jun 2024 11:12:25 -0700 Subject: [PATCH] copy --- _posts/2024-05-22-swe-bench-lite.md | 25 +++++++++++++++---------- _posts/2024-06-02-main-swe-bench.md | 24 ++++++++++++++---------- 2 files changed, 29 insertions(+), 20 deletions(-) diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index b97309725..b87d57819 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -13,17 +13,11 @@ achieving a state-of-the-art result. The previous top leaderboard entry was 20.3% from Amazon Q Developer Agent. -**To be clear, all of aider's results reported here are pass@1 results.** -The "aider agent" internally makes multiple "attempts" at solving the problem, -but it picks and returns one single candidate solution. -Only that one candidate solution is evaluated with the acceptance tests -and contributes to the benchmark score. -This is contrast to a pass@N result for N>1, where N attempts are made -and all N solutions are evaluated by the acceptance tests. -If *any* of the N pass, that counts as a pass@N success. - [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg) +**To be clear, all of aider's results reported here are pass@1 results, +obtained without using the SWE Bench `hints_text`.** +All results in the above chart are unhinted pass@1 results. Please see the [references](#references) for details on the data presented in this chart. It was corrected on 5/30/24 to reflect apples-to-apples comparisons, @@ -413,7 +407,18 @@ making it faster, easier, and more reliable to run the acceptance tests. ## References -Below are the references for the SWE-Bench Lite results +To be clear, all of aider's results reported here are pass@1 results, +obtained without using the SWE Bench `hints_text`. + +The "aider agent" internally makes multiple "attempts" at solving the problem, +but it picks and returns one single candidate solution. +Only that one candidate solution is evaluated with the acceptance tests +and contributes to the benchmark score. +This is contrast to a pass@N result for N>1, where N attempts are made +and all N solutions are evaluated by the acceptance tests. +If *any* of the N solution pass, that counts as a pass@N success. + +Below are the references for the pass@1 unhinted SWE-Bench results displayed in the graph at the beginning of this article. - [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com) diff --git a/_posts/2024-06-02-main-swe-bench.md b/_posts/2024-06-02-main-swe-bench.md index a3179e9f7..0d72da5ac 100644 --- a/_posts/2024-06-02-main-swe-bench.md +++ b/_posts/2024-06-02-main-swe-bench.md @@ -18,17 +18,10 @@ The best result reported elsewhere seems to be This result on the main SWE Bench builds on [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). -**To be clear, all of aider's results reported here are pass@1 results.** -The "aider agent" internally makes multiple "attempts" at solving the problem, -but it picks and returns one single candidate solution. -Only that one candidate solution is evaluated with the acceptance tests -and contributes to the benchmark score. -This is contrast to a pass@N result for N>1, where N attempts are made -and all N solutions are evaluated by the acceptance tests. -If *any* of the N pass, that counts as a pass@N success. - [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg) +**To be clear, all of aider's results reported here are pass@1 results, +obtained without using the SWE Bench `hints_text`.** Aider was benchmarked on the same [570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that were used in the @@ -238,7 +231,18 @@ making it faster, easier, and more reliable to run the acceptance tests. ## References -Below are the references for the SWE-Bench results +To be clear, all of aider's results reported here are pass@1 results, +obtained without using the SWE Bench `hints_text`. + +The "aider agent" internally makes multiple "attempts" at solving the problem, +but it picks and returns one single candidate solution. +Only that one candidate solution is evaluated with the acceptance tests +and contributes to the benchmark score. +This is contrast to a pass@N result for N>1, where N attempts are made +and all N solutions are evaluated by the acceptance tests. +If *any* of the N solution pass, that counts as a pass@N success. + +Below are the references for the pass@1 unhinted SWE-Bench results displayed in the graph at the beginning of this article. - [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)