mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-02 10:45:00 +00:00
Updated graph to use pass@1 unhinted results from other agents
This commit is contained in:
parent
07731e30dc
commit
6966936316
5 changed files with 435 additions and 213 deletions
|
@ -12,11 +12,17 @@ on the
|
|||
achieving a state-of-the-art result.
|
||||
The current top leaderboard entry is 20.3%
|
||||
from Amazon Q Developer Agent.
|
||||
The best result reported elsewhere seems to be
|
||||
[25% from OpenDevin](https://x.com/gneubig/status/1791498953709752405).
|
||||
|
||||
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||
|
||||
Please see the [references](#references)
|
||||
for details on the data presented in this chart.
|
||||
It was updated 5/30/24 to reflect apples-to-apples comparisons,
|
||||
using pass@1 results from AutoCodeRover
|
||||
and results from OpenDevin that don't use hints.
|
||||
The [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
||||
only accepts pass@1 results that do not use hints.
|
||||
|
||||
## Interactive, not agentic
|
||||
|
||||
Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||
|
@ -397,14 +403,33 @@ making it faster, easier, and more reliable to run the acceptance tests.
|
|||
## References
|
||||
|
||||
Below are the references for the SWE-Bench Lite results
|
||||
displayed in the graph at the top of this page.
|
||||
displayed in the graph at the beginning of this article.
|
||||
|
||||
- [25.0% OpenDevin](https://x.com/gneubig/status/1791498953709752405)
|
||||
- [19.0% AutoCodeRover](https://github.com/swe-bench/experiments/pull/11)
|
||||
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
|
||||
- [19.0% AutoCodeRover](https://github.com/swe-bench/experiments/pull/11)
|
||||
- [18.0% SWE-Agent + GPT-4](https://www.swebench.com)
|
||||
- [16.7% OpenDevin](https://github.com/OpenDevin/OpenDevin/issues/2149)
|
||||
- [11.7% SWE-Agent + Opus](https://www.swebench.com)
|
||||
|
||||
Note: Graph updated on 5/30/24 to accurately reflect AutoCodeRover's pass@1 results.
|
||||
The previous graph contained their pass@3 result, which is not comparable
|
||||
to the aider results being reported here.
|
||||
Note, the graph was updated on 5/30/24 as follows.
|
||||
|
||||
The graph now contains AutoCodeRover's pass@1 results.
|
||||
Previously it was reporting the pass@3 results, which are
|
||||
not comparable
|
||||
to the pass@1 aider results being reported here.
|
||||
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
|
||||
features the pass@3 results
|
||||
without being clearly labeled.
|
||||
|
||||
The graph now contains the best OpenDevin results obtained without using
|
||||
the `hints_text` to provide hints to the agent.
|
||||
The previous graph contained their hinted result,
|
||||
which is not comparable
|
||||
to the unhinted aider results being reported here.
|
||||
OpenDevin's [hinted result was reported](https://x.com/gneubig/status/1791498953709752405)
|
||||
without noting that hints were used.
|
||||
|
||||
The [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
||||
only accepts pass@1 results that do not use `hints_text`.
|
||||
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue