added draft article

2025-05-28 16:25:00 +00:00 · 2024-05-31 09:33:31 -07:00 · 2024-05-31 09:33:31 -07:00 · 0120d434ff
commit 0120d434ff
parent a168daf5fc
5 changed files with 2433 additions and 2 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -0,0 +1,212 @@
+---
+title: Aider is SOTA for both the main SWE Bench and SWE Bench Lite
+excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
+highlight_image: /assets/swe_bench.jpg
+draft: true
+---
+
+# Aider is SOTA for both the main SWE Bench and SWE Bench Lite
+ 
+Aider scored 18.8%
+on the main
+[SWE Bench benchmark](https://www.swebench.com),
+achieving a state-of-the-art result. 
+The current top leaderboard entry is 13.8%
+from Amazon Q Developer Agent.
+
+This is in addition to
+[aider's SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html)
+that was reported last week.
+
+[![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
+
+Aider was benchmarked on 570 of the 2294 SWE Bench problems.
+These are the same [randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs) that
+[Devin used in their evalulation](https://www.cognition.ai/post/swe-bench-technical-report).
+Please see the [references](#references)
+for more details on the data presented in this chart.
+
+## Interactive, not agentic
+
+Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
+Aider intentionally has quite limited and narrow "agentic behavior"
+to avoid long delays, high token costs
+and the need for users to repeatedly code review incorrect solutions.
+It's also worth noting that aider currently does not use
+RAG, vector search, tools or give the LLM access to search the web
+or unilaterally execute code.
+
+Aider is first and foremost an interactive tool for engineers to get real work done in
+real code bases using a chat interface.
+Aider provides a pair programming UX where users can ask for a change
+and see the edits performed in real-time.
+Aider can also offer additional help like fixing lint or test errors,
+but the user is always in full interactive control.
+This lets them quickly steer misunderstandings back on course and
+avoid wasting time and token costs.
+
+
+## Benchmark methodology
+
+For the benchmark, 
+aider with GPT-4o was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user."
+After that aider runs as normal, with the following modifications:
+
+- Aider's suggestions were always accepted without user approval.
+- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
+Plausibly correct means that aider reported that it had successfully edited the repo
+without causing syntax errors or breaking any *pre-existing* tests.
+- If the solution isn't plausible, the harness launches aider to try again from scratch,
+this time using Claude 3 Opus.
+- If no plausible solution is found after those two tries, the harness picks the "most plausible" solution with the fewest edit/lint/test problems.
+
+It's important to be clear that
+*aider and the benchmark harness
+only had access to the pre-existing tests in each problem's repo*.
+The held out "acceptance tests" were *only* used
+after benchmarking to compute statistics on which problems aider
+correctly resolved.
+
+This is the same methodology
+that was used for [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
+The only difference is that at most two tries were attempted instead of six,
+due to the increased token costs involved in this benchmark.
+The SWE Bench problems are more difficult and involve edits to
+more than one source file,
+which increased the cost of solving each problem.
+Further, aider was benchmarked on 570 SWE Bench problems,
+versus only 300 Lite problems,
+adding another factor of ~two to the costs.
+
+For a detailed discussion of the methodology, please see the
+[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
+The [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench) also contains
+the harness and reporting code used for the benchmarks.
+
+The benchmarking process was similar to how a developer might use aider to
+resolve a GitHub issue:
+
+- They could launch aider in their repo with the command below, which
+tells aider they want to accept every suggestion
+and to use pytest to run tests.
+  - `aider --yes --test-cmd pytest`
+- They could start the chat by pasting in the URL or text of a GitHub issue.
+Aider will pull in the URL's content and then try and solve the issue.
+- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
+[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
+so it's always easy to revert AI changes that don't pan out.
+
+## Aider with GPT-4o alone was SOTA
+
+Running the benchmark harness
+only using aider with GPT-4o to find plausible solutions with a single attempt
+achieved a score of 17.0%.
+This was itself a state-of-the-art result, before being surpassed by the main
+result being reported here
+that used aider with both GPT-4o & Opus.
+
+## Aider with GPT-4o & Opus
+
+The benchmark harness started by running aider with GPT-4o once to try
+and solve the problem. If
+no plausible solution was found, it then used aider with Opus
+once to try and solve the problem.
+
+The table below breaks down the proposed solutions that
+were found for the 570 problems.
+A proposed solution is either:
+
+- A plausible solution where
+aider reported no outstanding errors from editing, linting and testing.
+- Or, the "most plausible" solution generated by either attempt, with the
+[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
+
+The table also provides details on the 107 solutions that were ultimately
+verified as correctly resolving their issue.
+
+| Attempt | Agent |Number&nbsp;of<br>proposed<br>solutions|Percent&nbsp;of<br>proposed<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
+|:--------:|------------|---------:|---------:|----:|---:|--:|
+| 1 | Aider with GPT-4o    | 419 | 73.5% | 87 | 81.3% | 15.3% |
+| 2 | Aider with Opus      | 151 | 26.5% | 20 | 18.7% |  3.5% |
+| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
+
+If we break down the solutions solely by model,
+we can see that aider with GPT-4o outperforms Opus.
+This isn't a fair and direct comparison, because GPT-4o always took the first
+turn and therefore got first crack at all the "easiest" problems.
+Aider with Opus only ever saw problems that GPT-4o failed to
+find proposed solutions for on its first try.
+
+Aider with GPT-4o was producing higher quality proposed solutions,
+with a greater chance of going on to be accepted as resolving the issue.
+Again, this is biased by the turn ordering.
+But other anecdotal evidence from earlier runs of the benchmark
+also supports the observation that aider with GPT-4o is significantly stronger than Opus
+for this benchmark.
+
+
+| Agent      | Number&nbsp;of<br>proposed<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>proposed<br>which<br>correctly<br>resolved<br>| 
+|------------|---------:|---------:|---:|
+| Aider with GPT-4o    | 419 | 87 |20.8% |
+| Aider with Opus      | 151 | 20 |13.2% |
+| **Total** | **570** | **107** |**18.8%** |
+
+
+## Computing the benchmark score
+
+After benchmarking,
+a separate evaluation script was used to
+test each of these solutions with the full test suite,
+including the held out acceptance tests.
+For this final acceptance testing, any edits that aider made to tests
+were discarded.
+This ensured that the correct,
+unmodified test suite is used for acceptance testing.
+The evaluation script compared the test results
+with results from testing
+the "gold" patch that was developed by a human to correctly solve the issue.
+If they matched, the candidate solution correctly resolved the issue.
+
+These acceptance tests were only ever run outside of aider
+and the benchmark harness, and only to compute the number of
+correctly resolved instances.
+They were never run, used, or even visible during aider's attempts to solve the problems.
+
+Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
+or 18.8%.
+
+## Acknowledgments
+
+Much thanks to the team behind the
+[SWE Bench](https://www.swebench.com)
+family of AI coding benchmarks.
+Also thanks to Albert Örwall who has
+[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
+making it faster, easier, and more reliable to run the acceptance tests.
+
+
+## References
+
+Below are the references for the SWE-Bench results
+displayed in the graph at the beginning of this article.
+
+- [13.9% Devin (benchmarked on 570 instances)](https://www.cognition.ai/post/swe-bench-technical-report)
+- [13.8% Amazon Q Developer Agent (benchmarked on 2294 instances)](https://www.swebench.com)
+- [12.5% SWE- Agent + GPT-4 (benchmarked on 2294 instances)](https://www.swebench.com)
+- [10.6% AutoCode Rover (benchmarked on 2294 instances)](https://arxiv.org/pdf/2404.05427v2)
+- [10.5% SWE- Agent + Opus (benchmarked on 2294 instances)](https://www.swebench.com)
+
+The graph contains average pass@1 results for AutoCodeRover.
+The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
+features their pass@3 results
+without being clearly labeled.
+Table 2 of their
+[paper](https://arxiv.org/pdf/2404.05427v2)
+reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
+
+The [official SWE Bench Lite leaderboard](https://www.swebench.com)
+only accepts pass@1 results.
+
+
--- a/assets/swe-bench.jpg
+++ b/assets/swe-bench.jpg
--- a/assets/swe-bench.svg
+++ b/assets/swe-bench.svg
--- a/benchmark/swe-bench.txt
+++ b/benchmark/swe-bench.txt
@ -0,0 +1,7 @@
+18.8% Aider|GPT-4o|& Opus|(570)
+17.0% Aider|GPT-4o|(570)
+13.9% Devin|(570)
+13.8% Amazon Q|Developer|Agent|(2294)
+12.5% SWE-|Agent|+ GPT-4|(2294)
+10.6% AutoCode|Rover|(2294)
+10.5% SWE-|Agent|+ Opus|(2294)
--- a/benchmark/swe_bench_lite.py
+++ b/benchmark/swe_bench_lite.py
@ -52,7 +52,7 @@ def plot_swe_bench_lite(data_file):

    for model, bar in zip(models, bars):
        yval = bar.get_height()
-        y = yval - 1.25
+        y = yval - 1
        va = "top"
        color = "#eee" if "Aider" in model else "#555"
        fontfamily = "Helvetica Bold" if "Aider" in model else "Helvetica"
@ -76,7 +76,7 @@ def plot_swe_bench_lite(data_file):
    ax.set_title(title, fontsize=20)
    # ax.set_ylim(0, 29.9)
    plt.xticks(
-        fontsize=16,
+        fontsize=17,
        color=font_color,
    )