This commit is contained in:
Paul Gauthier 2024-06-03 11:12:25 -07:00
parent 4770e0ffc0
commit 8a8f3936f4
2 changed files with 29 additions and 20 deletions

View file

@ -13,17 +13,11 @@ achieving a state-of-the-art result.
The previous top leaderboard entry was 20.3%
from Amazon Q Developer Agent.
**To be clear, all of aider's results reported here are pass@1 results.**
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N pass, that counts as a pass@N success.
[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
**To be clear, all of aider's results reported here are pass@1 results,
obtained without using the SWE Bench `hints_text`.**
All results in the above chart are unhinted pass@1 results.
Please see the [references](#references)
for details on the data presented in this chart.
It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
@ -413,7 +407,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
## References
Below are the references for the SWE-Bench Lite results
To be clear, all of aider's results reported here are pass@1 results,
obtained without using the SWE Bench `hints_text`.
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N solution pass, that counts as a pass@N success.
Below are the references for the pass@1 unhinted SWE-Bench results
displayed in the graph at the beginning of this article.
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)

View file

@ -18,17 +18,10 @@ The best result reported elsewhere seems to be
This result on the main SWE Bench builds on
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
**To be clear, all of aider's results reported here are pass@1 results.**
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N pass, that counts as a pass@N success.
[![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
**To be clear, all of aider's results reported here are pass@1 results,
obtained without using the SWE Bench `hints_text`.**
Aider was benchmarked on the same
[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
that were used in the
@ -238,7 +231,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
## References
Below are the references for the SWE-Bench results
To be clear, all of aider's results reported here are pass@1 results,
obtained without using the SWE Bench `hints_text`.
The "aider agent" internally makes multiple "attempts" at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
This is contrast to a pass@N result for N>1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If *any* of the N solution pass, that counts as a pass@N success.
Below are the references for the pass@1 unhinted SWE-Bench results
displayed in the graph at the beginning of this article.
- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)