diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md new file mode 100644 index 000000000..7fec7475d --- /dev/null +++ b/_posts/2024-05-31-both-swe-bench.md @@ -0,0 +1,212 @@ +--- +title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite +excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version. +highlight_image: /assets/swe_bench.jpg +draft: true +--- + +# Aider is SOTA for both the main SWE Bench and SWE Bench Lite + +Aider scored 18.8% +on the main +[SWE Bench benchmark](https://www.swebench.com), +achieving a state-of-the-art result. +The current top leaderboard entry is 13.8% +from Amazon Q Developer Agent. + +This is in addition to +[aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html) +that was reported last week. + +[![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg) + +Aider was benchmarked on 570 of the 2294 SWE Bench problems. +These are the same [randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that +[Devin used in their evalulation](https://www.cognition.ai/post/swe-bench-technical-report). +Please see the [references](#references) +for more details on the data presented in this chart. + +## Interactive, not agentic + +Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming. +Aider intentionally has quite limited and narrow "agentic behavior" +to avoid long delays, high token costs +and the need for users to repeatedly code review incorrect solutions. +It's also worth noting that aider currently does not use +RAG, vector search, tools or give the LLM access to search the web +or unilaterally execute code. + +Aider is first and foremost an interactive tool for engineers to get real work done in +real code bases using a chat interface. +Aider provides a pair programming UX where users can ask for a change +and see the edits performed in real-time. +Aider can also offer additional help like fixing lint or test errors, +but the user is always in full interactive control. +This lets them quickly steer misunderstandings back on course and +avoid wasting time and token costs. + + +## Benchmark methodology + +For the benchmark, +aider with GPT-4o was launched in each problem's git repository +with the problem statement +submitted as the opening chat message from "the user." +After that aider runs as normal, with the following modifications: + +- Aider's suggestions were always accepted without user approval. +- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. +Plausibly correct means that aider reported that it had successfully edited the repo +without causing syntax errors or breaking any *pre-existing* tests. +- If the solution isn't plausible, the harness launches aider to try again from scratch, +this time using Claude 3 Opus. +- If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems. + +It's important to be clear that +*aider and the benchmark harness +only had access to the pre-existing tests in each problem's repo*. +The held out "acceptance tests" were *only* used +after benchmarking to compute statistics on which problems aider +correctly resolved. + +This is the same methodology +that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html). +The only difference is that at most two tries were attempted instead of six, +due to the increased token costs involved in this benchmark. +The SWE Bench problems are more difficult and involve edits to +more than one source file, +which increased the cost of solving each problem. +Further, aider was benchmarked on 570 SWE Bench problems, +versus only 300 Lite problems, +adding another factor of ~two to the costs. + +For a detailed discussion of the methodology, please see the +[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html). +The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains +the harness and reporting code used for the benchmarks. + +The benchmarking process was similar to how a developer might use aider to +resolve a GitHub issue: + +- They could launch aider in their repo with the command below, which +tells aider they want to accept every suggestion +and to use pytest to run tests. + - `aider --yes --test-cmd pytest` +- They could start the chat by pasting in the URL or text of a GitHub issue. +Aider will pull in the URL's content and then try and solve the issue. +- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time. +[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git), +so it's always easy to revert AI changes that don't pan out. + +## Aider with GPT-4o alone was SOTA + +Running the benchmark harness +only using aider with GPT-4o to find plausible solutions with a single attempt +achieved a score of 17.0%. +This was itself a state-of-the-art result, before being surpassed by the main +result being reported here +that used aider with both GPT-4o & Opus. + +## Aider with GPT-4o & Opus + +The benchmark harness started by running aider with GPT-4o once to try +and solve the problem. If +no plausible solution was found, it then used aider with Opus +once to try and solve the problem. + +The table below breaks down the proposed solutions that +were found for the 570 problems. +A proposed solution is either: + +- A plausible solution where +aider reported no outstanding errors from editing, linting and testing. +- Or, the "most plausible" solution generated by either attempt, with the +[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). + +The table also provides details on the 107 solutions that were ultimately +verified as correctly resolving their issue. + +| Attempt | Agent |Number of
proposed
solutions|Percent of
proposed
solutions| Number of
correctly
resolved
solutions | Percent of
correctly
resolved
solutions | Score on
SWE Bench
Lite | +|:--------:|------------|---------:|---------:|----:|---:|--:| +| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% | +| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% | +| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** | + +If we break down the solutions solely by model, +we can see that aider with GPT-4o outperforms Opus. +This isn't a fair and direct comparison, because GPT-4o always took the first +turn and therefore got first crack at all the "easiest" problems. +Aider with Opus only ever saw problems that GPT-4o failed to +find proposed solutions for on its first try. + +Aider with GPT-4o was producing higher quality proposed solutions, +with a greater chance of going on to be accepted as resolving the issue. +Again, this is biased by the turn ordering. +But other anecdotal evidence from earlier runs of the benchmark +also supports the observation that aider with GPT-4o is significantly stronger than Opus +for this benchmark. + + +| Agent | Number of
proposed
solutions | Number of
correctly
resolved
solutions | Percent of
proposed
which
correctly
resolved
| +|------------|---------:|---------:|---:| +| Aider with GPT-4o | 419 | 87 |20.8% | +| Aider with Opus | 151 | 20 |13.2% | +| **Total** | **570** | **107** |**18.8%** | + + +## Computing the benchmark score + +After benchmarking, +a separate evaluation script was used to +test each of these solutions with the full test suite, +including the held out acceptance tests. +For this final acceptance testing, any edits that aider made to tests +were discarded. +This ensured that the correct, +unmodified test suite is used for acceptance testing. +The evaluation script compared the test results +with results from testing +the "gold" patch that was developed by a human to correctly solve the issue. +If they matched, the candidate solution correctly resolved the issue. + +These acceptance tests were only ever run outside of aider +and the benchmark harness, and only to compute the number of +correctly resolved instances. +They were never run, used, or even visible during aider's attempts to solve the problems. + +Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked, +or 18.8%. + +## Acknowledgments + +Much thanks to the team behind the +[SWE Bench](https://www.swebench.com) +family of AI coding benchmarks. +Also thanks to Albert Örwall who has +[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker) +making it faster, easier, and more reliable to run the acceptance tests. + + +## References + +Below are the references for the SWE-Bench results +displayed in the graph at the beginning of this article. + +- [13.9% Devin (benchmarked on 570 instances)](https://www.cognition.ai/post/swe-bench-technical-report) +- [13.8% Amazon Q Developer Agent (benchmarked on 2294 instances)](https://www.swebench.com) +- [12.5% SWE- Agent + GPT-4 (benchmarked on 2294 instances)](https://www.swebench.com) +- [10.6% AutoCode Rover (benchmarked on 2294 instances)](https://arxiv.org/pdf/2404.05427v2) +- [10.5% SWE- Agent + Opus (benchmarked on 2294 instances)](https://www.swebench.com) + +The graph contains average pass@1 results for AutoCodeRover. +The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover) +features their pass@3 results +without being clearly labeled. +Table 2 of their +[paper](https://arxiv.org/pdf/2404.05427v2) +reports an `ACR-avg` result of 10.59% which is an average pass@1 result. + +The [official SWE Bench Lite leaderboard](https://www.swebench.com) +only accepts pass@1 results. + + diff --git a/assets/swe-bench.jpg b/assets/swe-bench.jpg new file mode 100644 index 000000000..37c2769ce Binary files /dev/null and b/assets/swe-bench.jpg differ diff --git a/assets/swe-bench.svg b/assets/swe-bench.svg new file mode 100644 index 000000000..3f5583403 --- /dev/null +++ b/assets/swe-bench.svg @@ -0,0 +1,2212 @@ + + + + + + + + 2024-05-31T09:30:46.920987 + image/svg+xml + + + Matplotlib v3.9.0, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt new file mode 100644 index 000000000..7d5f34ee2 --- /dev/null +++ b/benchmark/swe-bench.txt @@ -0,0 +1,7 @@ +18.8% Aider|GPT-4o|& Opus|(570) +17.0% Aider|GPT-4o|(570) +13.9% Devin|(570) +13.8% Amazon Q|Developer|Agent|(2294) +12.5% SWE-|Agent|+ GPT-4|(2294) +10.6% AutoCode|Rover|(2294) +10.5% SWE-|Agent|+ Opus|(2294) diff --git a/benchmark/swe_bench_lite.py b/benchmark/swe_bench_lite.py index 023f50d5f..64b2f044b 100644 --- a/benchmark/swe_bench_lite.py +++ b/benchmark/swe_bench_lite.py @@ -52,7 +52,7 @@ def plot_swe_bench_lite(data_file): for model, bar in zip(models, bars): yval = bar.get_height() - y = yval - 1.25 + y = yval - 1 va = "top" color = "#eee" if "Aider" in model else "#555" fontfamily = "Helvetica Bold" if "Aider" in model else "Helvetica" @@ -76,7 +76,7 @@ def plot_swe_bench_lite(data_file): ax.set_title(title, fontsize=20) # ax.set_ylim(0, 29.9) plt.xticks( - fontsize=16, + fontsize=17, color=font_color, )