Added draft post

2025-05-28 08:14:59 +00:00 · 2024-05-22 15:03:11 -07:00 · 2024-05-22 15:03:11 -07:00 · db30e5da2b
commit db30e5da2b
parent d94da4f809
1 changed files with 364 additions and 0 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -0,0 +1,364 @@
+---
+title: Aider scores SOTA 26.3% on SWE Bench Lite
+excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
+draft: true
+---
+
+# Aider scores SOTA 26.3% on SWE Bench Lite
+ 
+[Aider scored 26.3%]()
+on the
+[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result. 
+The current top leaderboard entry is 20.33%
+from Amazon Q Developer Agent.
+The best result reported elsewhere online seems to be
+[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
+
+Aider achieved this result mainly through its focus on static code analysis,
+reliable LLM code editing
+and pragmatic workflows for interactive pair programming with AI.
+Aider intentionally has quite limited and narrow "agentic behavior":
+it doesn't require a highly detailed upfront "spec" from the user,
+use RAG or vector search, farm out sub-problems to an army of LLMs,
+allow the LLM to use tools
+or perform  web searches,
+etc.
+
+Aider is first and foremost a tool for engineers to get real work done in
+real code bases through a pair programming chat style interface.
+In normal use, the user is in full interactive control. 
+This lets them quickly steer misunderstandings back on course and
+avoid wasted time, code reviews and token costs.
+When a user asks aider for a change, they see the edits performed in real-time. 
+Aider may also then offer additional
+help like fixing lint or test errors.
+
+For the benchmark, 
+aider was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user".
+After that aider runs as normal, with the following modifications:
+
+- Aider's suggestions were always accepted.
+When chatting, aider will suggest which files in the repo may need to be edited based on
+the conversation.
+It will offer to lint code that has been edited,
+and to fix any issues uncovered.
+Aider has workflows to run the repo's test suite and resolve failing tests.
+Normally the user is asked to approved such suggestions, but
+they were always accepted during the benchmark.
+- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
+Plausible means that aider successfully edited the repo without breaking anything.
+As mentioned, aider has integrated support for linting and testing,
+so the harness just looks at aider's completion status to see if those
+operations finished clean.
+Note that *aider only had access to the pre-existing tests in the repo*,
+not the held out "acceptance tests" that are used later to see if the
+SWE Bench problem was correctly resolved.
+- If the solution isn't plausible, the harness launches aider to try again from scratch.
+The harness alternates between running aider with GPT-4o and Opus up to three times each,
+until it finds a plausible solution.
+- If no plausible solution is found, the harness picks the solution
+with the least amount of edit/lint/test problems.
+
+This is all roughly equivalent to a user:
+
+- Launching aider in their repo with the something like command below, which
+tells aider to say yes to every suggestion and use pytest to run tests.
+  - `aider --yes --test-cmd pytest`
+- Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like:
+  - `/web https://github.com/django/django/issues/XXX`
+- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
+[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
+so it's always easy to undo/revert AI changes that don't pan out.
+
+Of course, outside a benchmark setting it's probably
+unwise to let *any* AI agent run unsupervised on your code base.
+Aider is intended to be used as an interactive pair-programming chat,
+where the user participates to direct aider's work and approve suggestions.
+This way the user can offer immediate feedback or corrections if their initial
+instructions turn out to be ambiguous,
+or if the AI starts going down a wrong path.
+
+## Aider with GPT-4o alone was SOTA
+
+Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
+achieved a score of 25%.
+This was itself a state of the art result, before being surpassed by the main
+result being reported here
+that uses aider with both GPT-4o & Opus.
+
+## GPT-4o vs Opus
+
+The benchmark harness alternated between running aider with GPT-4o and Opus.
+The harness proceeded in a fixed order, always starting with GPT-4o and
+then alternating with Opus until a plausible solution was found.
+
+The table below breaks down the 79 solutions which were ultimately
+verified as correctly resolving their task.
+Some noteworthy observations:
+
+- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
+- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
+- A long tail of solutions continued to be found by both models including on the final, 6th attempt.
+
+
+| Attempt | Model      | Number<br/>resolved | Percent<br/>of resolved | Cumulative<br/>percent of<br/>resolved |
+|:--------:|------------|---------:|---------:|----:|
+| 1 | GPT-4o | 61 | 77.2 | 77.2
+| 2 | Opus   | 10 | 12.7 | 89.9
+| 3 | GPT-4o |  3 |  3.8 | 93.7
+| 4 | Opus   |  2 |  2.5 | 96.2
+| 5 | GPT-4o |  2 |  2.5 | 98.7
+| 6 | Opus   |  1 |  1.3 | 100.0
+|**Total**|   | **79** | **100%** | **100%** |
+
+If we just look at which models produced correct solutions,
+we can see that GPT-4o dominates.
+This isn't a fair comparison, because GPT-4o always took the first
+attempt at solving.
+But anecdotal evidence from early runs of the benchmark
+supports the observation that GPT-4o is significantly stronger than Opus
+for this endeavor.
+
+| Model      | Number resolved | Percent of resolved | 
+|------------|---------:|---------:|
+| GPT-4o | 66 | 83.5 |
+| Opus   | 13 | 16.5 |
+|**Total**| **79** | **100%** |
+
+
+## Repository map, not RAG
+
+The crucial first step in solving a SWE Bench problem is figuring out
+which parts of the repo are relevant and which files need to be edited.
+Most coding agents use some combination of RAG, vector search
+and arming the LLM with
+tools to interactively explore the code base.
+
+Aider instead uses a
+[repository map](https://aider.chat/2023/10/22/repomap.html)
+to help the LLM understand the 
+layout, code structure and content of a git repo.
+The repo map is created from the code's AST and call graph
+to provide a compact and powerful summary of the entire code base.
+The map is constantly
+tailored to show
+repo context that is relevant to the current state of the chat conversation.
+
+by performing a graph optimization on the code's call graph.
+
+When the user asks for a change to their code, the LLM uses the repo map
+to decide which files to edit.
+The LLM simply returns a normal text response explaining which files
+it needs to edit and why.
+Aider notices when the LLM mentions filenames from the repo,
+and asks the user if they should be added to the chat.
+Adding a file to the chat allows the LLM to see the full contents
+of the file and edit it.
+
+<div class="chat-transcript" markdown="1">
+
+#### Please add a new /factorial/N endpoint.
+
+To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
+Please add app.py to the chat so I can proceed with the changes.
+
+> app.py  
+> Add these files to the chat? yes
+
+</div>
+
+This is a convenient and natural workflow for interactive chat,
+and it worked well for the SWE Bench tasks.
+Each task comes with a “gold” patch, which was created by a human developer
+to solve the issue. 
+Aider successfully identified and added the file from the gold patch
+in 70.3% of the benchmark tasks.
+
+Of course aider is not able to see or use the gold patch
+or the files it names in any way. 
+They were only used to compute this statistic after the benchmarking was completed. 
+
+
+## Reliable code editing
+
+Once files have been selected for editing,
+the next step is of course to edit the source code to fix the problem.
+
+Aider has  always had a deep focus on ensuring that LLMs can not just write code,
+but reliably *edit* code.
+Aider a collection of prompting strategies and code editing backends which have
+been honed through
+[extensive benchmarking](https://aider.chat/docs/leaderboards/).
+These foundational capabilities help ensure that the LLM can not only code up a solution but
+also properly integrate it into the existing code base and source files.
+
+The repository map helps here too, making sure that the LLM
+can see relevant classes, functions and variables from the entire repo.
+This helps ensure that the project's existing APIs and conventions are
+respected when new code is added.
+
+## Linting and fixing
+
+[Aider lints code]()
+after every LLM edit, and offers to automatically fix
+any linting errors.
+Aider includes basic linters built with tree-sitter that support
+[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
+These built in linters will detect syntax errors and other fatal problems with the code.
+
+Users can also configure aider to use their preferred linters.
+This allows aider to check for a larger class of problems, keep the code style
+aligned with the rest of the repo, etc.
+But for the benchmark, aider simply used its built-in linters.
+
+Aider shows linting errors to the LLM in a novel format,
+using the abstract syntax tree (AST) to display relevant code context for each
+error.
+This context increases the ability of the LLM to understand the problem and
+make the correct changes to resolve it.
+
+<div class="chat-transcript" markdown="1">
+
+> app.py:23:36: F821 undefined name 'num'  
+> app.py:41:16: F541 f-string is missing placeholders  
+>   
+> app.py:  
+> ...⋮...  
+>   6│class LongNum:  
+>   7│    def __init__(self, num):  
+>   8│        """  
+>   9│        Initialize the number.  
+>  10│        """  
+> ...⋮...  
+>  19│    def __str__(self):  
+>  20│        """  
+>  21│        Render the number as a string.  
+>  22│        """  
+>  23█        return str(num)  
+>  24│  
+>  25│  
+>  26│@app.route('/subtract/<int:x>/<int:y>')  
+> ...⋮...  
+>  38│@app.route('/divide/<int:x>/<int:y>')  
+>  39│def divide(x, y):  
+>  40│    if y == 0:  
+>  41█        return f"Error: Cannot divide by zero"  
+>  42│    else:  
+>  43│        result = x / y  
+>  44│        return str(result)  
+>  45│  
+> ...⋮...  
+>   
+> Attempt to fix lint errors? yes
+
+</div>
+
+
+## Testing and fixing
+
+Aider can be configured with the command needed to run tests for a repo.
+A user working on a python project might do that by launching
+aider like this:
+
+```
+aider --test-cmd pytest
+``` 
+
+The repositories that are used in the SWE Bench problems are large open
+source projects with extensive existing test suites.
+A repo's test suite can be run in three ways:
+
+1. Run tests as they existed before trying to solve the problem, without any changes.
+2. Run tests after aider has modified the repo.
+So the pre-existing test cases are still present, but may have been modified by aider.
+Aider may have also added new tests.
+3. Run the final "acceptance tests" to judge if the coding agent has
+successfully resolved the problem.
+SWE Bench verifies both pre-existing tests and a set of held out acceptance tests
+(from the so called `test_patch`)
+to check that the issue is properly resolved. During this final acceptance testing,
+any aider edits to tests are discard to ensure a faithful test of whether the
+issue was resolved.
+
+For the benchmark, aider is configured with a test command that will run the tests
+as described in (2) above.
+So testing will fail if aider has broken any pre-existing tests or if any new
+tests that it created aren't passing.
+When aider runs a test command, it checks for a non-zero exit status.
+In this case,
+aider will automatically
+share the test output with the LLM and ask it to 
+try and resolve the test failures.
+
+To be clear, *aider can not run or even see the "acceptance tests"* from the `test_patch`
+as described in (3).
+Those tests are only run outside of aider and the benchmark harness,
+to compute the final benchmark score.
+
+
+
+## Finding a plausible solution
+
+As aider executes, it notes the outcome of the editing, linting and testing
+steps.
+When aider completes, it returns their final status as either:
+succeeded with no errors remaining,
+or ended without resolving all errors.
+
+The benchmark harness uses these outcomes to determine if it has a plausible
+solution to the current SWE Bench task.
+A plausible solution is one where aider
+returns saying that it 
+edited the repo with no outstanding
+edit, lint or test errors.
+In this case, aider's changes are taken as the proposed solution and recorded
+as the SWE Bench `model_patch` to be evaluated later with the
+`test_patch` "acceptance tests".
+
+If the solution is not plausible, another
+instance of aider is launched again from scratch on the same problem.
+The harness alternates asking GPT-4o and Opus to solve the problem,
+and gives each model three attempts -- for a total of six attempts.
+As soon as a plausible solution is found, it is accepted and the
+harness moves on to the next SWE Bench instance.
+
+It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained.
+
+If all six attempts fail to produce a plausible solution,
+then the "best" solution available is selected as a the
+`model_patch`.
+Which of the non-plausible solutions to use is determined
+by ignoring the testing outcome
+and prioritizing solutions in the following order:
+
+ - Pick a solution where editing and linting were completed successfully.
+ - Pick a solution where editing was at least partially successful and linting succeeded.
+ - Pick a solution where editing was successful.
+ - Pick a solution where editing was at least partially successful.
+
+## Computing the benchmark score
+
+The benchmark harness produces one "best" solution for each of the 300
+SWE Bench Lite instances, and saves it as a `model_patch`.
+A separate evaluation script uses the SWE Bench support code to
+test each of these results with the acceptance tests.
+
+These `test_patch` acceptance tests are only ever run outside of aider
+and the benchmark harness, and only to compute the number of
+correctly resolved instances.
+They are never run, used or even visible during the attempts to solve the problems.
+
+Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
+
+## Acknowledgments
+
+Much thanks to the team behind the
+[SWE Bench](https://www.swebench.com)
+family of AI coding benchmarks.
+Also thanks to Albert Örwall who has
+[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
+making it faster, easier and more reliable to run the acceptance tests.
+
+