--- title: Aider is SOTA for both SWE Bench and SWE Bench Lite excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version. highlight_image: /assets/swe_bench.jpg nav_exclude: true --- {% if page.date %}
{{ page.date | date: "%B %d, %Y" }}
{% endif %} # Aider is SOTA for both SWE Bench and SWE Bench Lite Aider scored 18.9% on the main [SWE Bench benchmark](https://www.swebench.com), achieving a state-of-the-art result. The current top leaderboard entry is 13.8% from Amazon Q Developer Agent. The best result reported elsewhere seems to be [13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report). This result on the main SWE Bench builds on [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). [](https://aider.chat/assets/swe_bench.svg) **All of aider's results reported here are pass@1 results, obtained without using the SWE Bench `hints_text`.** Aider was benchmarked on the same [570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that were used in the [Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report). See the [references](#references) for more details on the data presented in this chart. ## Interactive, not agentic Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for automatically fixing linting and testing errors. Aider intentionally has quite limited and narrow "agentic behavior" to avoid long delays, high token costs and the need for users to repeatedly code review incorrect solutions. It's also worth noting that aider currently does not use RAG, vector search, tools or give the LLM access to search the web or unilaterally execute code. Aider is first and foremost an interactive tool for engineers to get real work done in real code bases using a chat interface. Aider provides a pair programming UX where users can ask for a change and see code edits performed in real-time. Aider can also offer additional help like fixing lint or test errors, but the user is always in full interactive control. This allows them to quickly steer misunderstandings back on course and avoid wasting time and token costs. ## Benchmark methodology Benchmarking was conducted as follows: - Aider with GPT-4o was launched in each problem's git repository with the problem statement submitted as the opening chat message from "the user". - After that aider ran as normal, except all of aider's suggestions were always accepted without user approval. - A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. Plausibly correct means that aider reported that it had successfully edited the repo without causing syntax errors or breaking any *pre-existing* tests. - If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus. - If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems. It's important to be clear that *aider and the benchmark harness only had access to the pre-existing tests in each problem's repo*. The held out "acceptance tests" were *only* used after benchmarking to compute statistics on which problems aider correctly resolved. This is the same approach that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). For the Lite benchmark, aider alternated between GPT-4o and Opus for up to six total attempts. To manage the cost of running the main SWE Bench benchmark, aider was limited to two total attempts: one with GPT-4o and one with Opus. For a detailed discussion of the benchmark methodology, see the [article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html). Also, the [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) contains the harness and statistics code used for the benchmarks. The benchmarking process was similar to how a developer might use aider to resolve a GitHub issue: - They could launch aider in their repo with the command below, which tells aider they want to accept every suggestion and to use pytest to run tests. - `aider --yes --test-cmd pytest` - They could start the chat by pasting in the URL or text of a GitHub issue. Aider will pull in the URL's content and then try and resolve the issue. - If aider doesn't produce code that lints and tests clean, the user might decide to [use git to revert the changes](https://aider.chat/docs/git.html), and try again with `aider --opus`. ## Aider with GPT-4o alone was SOTA Using aider with GPT-4o to make a single attempt at resolving each problem achieved a score of 17.0%. This was itself a state-of-the-art result, before being surpassed by the main result being reported here that used aider with both GPT-4o & Opus. ## Aider with GPT-4o & Opus The benchmark harness started by using aider with GPT-4o to try and resolve each problem. For problems where this didn't produce a plausible solution, the harness tried again using aider with Opus. So at most, two attempts were made for each problem. The table below breaks down the proposed solutions that were found from each attempt at the 570 problems. A proposed solution is either: - A plausible solution where aider reported no outstanding errors from editing, linting and testing. - Or, the "most plausible" solution generated by either attempt, with the [fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). The table also provides details on the 108 solutions that were ultimately verified as correctly resolving their issue. | Attempt | Agent |Number of