mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-31 01:35:00 +00:00
copy
This commit is contained in:
parent
4770e0ffc0
commit
8a8f3936f4
2 changed files with 29 additions and 20 deletions
|
@ -13,17 +13,11 @@ achieving a state-of-the-art result.
|
||||||
The previous top leaderboard entry was 20.3%
|
The previous top leaderboard entry was 20.3%
|
||||||
from Amazon Q Developer Agent.
|
from Amazon Q Developer Agent.
|
||||||
|
|
||||||
**To be clear, all of aider's results reported here are pass@1 results.**
|
|
||||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
|
||||||
but it picks and returns one single candidate solution.
|
|
||||||
Only that one candidate solution is evaluated with the acceptance tests
|
|
||||||
and contributes to the benchmark score.
|
|
||||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
|
||||||
and all N solutions are evaluated by the acceptance tests.
|
|
||||||
If *any* of the N pass, that counts as a pass@N success.
|
|
||||||
|
|
||||||
[](https://aider.chat/assets/swe_bench_lite.svg)
|
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||||
|
|
||||||
|
**To be clear, all of aider's results reported here are pass@1 results,
|
||||||
|
obtained without using the SWE Bench `hints_text`.**
|
||||||
|
All results in the above chart are unhinted pass@1 results.
|
||||||
Please see the [references](#references)
|
Please see the [references](#references)
|
||||||
for details on the data presented in this chart.
|
for details on the data presented in this chart.
|
||||||
It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
|
It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
|
||||||
|
@ -413,7 +407,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
Below are the references for the SWE-Bench Lite results
|
To be clear, all of aider's results reported here are pass@1 results,
|
||||||
|
obtained without using the SWE Bench `hints_text`.
|
||||||
|
|
||||||
|
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||||
|
but it picks and returns one single candidate solution.
|
||||||
|
Only that one candidate solution is evaluated with the acceptance tests
|
||||||
|
and contributes to the benchmark score.
|
||||||
|
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||||
|
and all N solutions are evaluated by the acceptance tests.
|
||||||
|
If *any* of the N solution pass, that counts as a pass@N success.
|
||||||
|
|
||||||
|
Below are the references for the pass@1 unhinted SWE-Bench results
|
||||||
displayed in the graph at the beginning of this article.
|
displayed in the graph at the beginning of this article.
|
||||||
|
|
||||||
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
|
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
|
||||||
|
|
|
@ -18,17 +18,10 @@ The best result reported elsewhere seems to be
|
||||||
This result on the main SWE Bench builds on
|
This result on the main SWE Bench builds on
|
||||||
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||||
|
|
||||||
**To be clear, all of aider's results reported here are pass@1 results.**
|
|
||||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
|
||||||
but it picks and returns one single candidate solution.
|
|
||||||
Only that one candidate solution is evaluated with the acceptance tests
|
|
||||||
and contributes to the benchmark score.
|
|
||||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
|
||||||
and all N solutions are evaluated by the acceptance tests.
|
|
||||||
If *any* of the N pass, that counts as a pass@N success.
|
|
||||||
|
|
||||||
[](https://aider.chat/assets/swe_bench.svg)
|
[](https://aider.chat/assets/swe_bench.svg)
|
||||||
|
|
||||||
|
**To be clear, all of aider's results reported here are pass@1 results,
|
||||||
|
obtained without using the SWE Bench `hints_text`.**
|
||||||
Aider was benchmarked on the same
|
Aider was benchmarked on the same
|
||||||
[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
|
[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
|
||||||
that were used in the
|
that were used in the
|
||||||
|
@ -238,7 +231,18 @@ making it faster, easier, and more reliable to run the acceptance tests.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
Below are the references for the SWE-Bench results
|
To be clear, all of aider's results reported here are pass@1 results,
|
||||||
|
obtained without using the SWE Bench `hints_text`.
|
||||||
|
|
||||||
|
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||||
|
but it picks and returns one single candidate solution.
|
||||||
|
Only that one candidate solution is evaluated with the acceptance tests
|
||||||
|
and contributes to the benchmark score.
|
||||||
|
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||||
|
and all N solutions are evaluated by the acceptance tests.
|
||||||
|
If *any* of the N solution pass, that counts as a pass@N success.
|
||||||
|
|
||||||
|
Below are the references for the pass@1 unhinted SWE-Bench results
|
||||||
displayed in the graph at the beginning of this article.
|
displayed in the graph at the beginning of this article.
|
||||||
|
|
||||||
- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
|
- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue