From db30e5da2b67404a58993469beadcff891b4a0bb Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 22 May 2024 15:03:11 -0700 Subject: [PATCH] Added draft post --- _posts/2024-05-22-swe-bench-lite.md | 364 ++++++++++++++++++++++++++++ 1 file changed, 364 insertions(+) create mode 100644 _posts/2024-05-22-swe-bench-lite.md diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md new file mode 100644 index 000000000..4021415f4 --- /dev/null +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -0,0 +1,364 @@ +--- +title: Aider scores SOTA 26.3% on SWE Bench Lite +excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result. +draft: true +--- + +# Aider scores SOTA 26.3% on SWE Bench Lite + +[Aider scored 26.3%]() +on the +[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result. +The current top leaderboard entry is 20.33% +from Amazon Q Developer Agent. +The best result reported elsewhere online seems to be +[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover). + +Aider achieved this result mainly through its focus on static code analysis, +reliable LLM code editing +and pragmatic workflows for interactive pair programming with AI. +Aider intentionally has quite limited and narrow "agentic behavior": +it doesn't require a highly detailed upfront "spec" from the user, +use RAG or vector search, farm out sub-problems to an army of LLMs, +allow the LLM to use tools +or perform web searches, +etc. + +Aider is first and foremost a tool for engineers to get real work done in +real code bases through a pair programming chat style interface. +In normal use, the user is in full interactive control. +This lets them quickly steer misunderstandings back on course and +avoid wasted time, code reviews and token costs. +When a user asks aider for a change, they see the edits performed in real-time. +Aider may also then offer additional +help like fixing lint or test errors. + +For the benchmark, +aider was launched in each problem's git repository +with the problem statement +submitted as the opening chat message from "the user". +After that aider runs as normal, with the following modifications: + +- Aider's suggestions were always accepted. +When chatting, aider will suggest which files in the repo may need to be edited based on +the conversation. +It will offer to lint code that has been edited, +and to fix any issues uncovered. +Aider has workflows to run the repo's test suite and resolve failing tests. +Normally the user is asked to approved such suggestions, but +they were always accepted during the benchmark. +- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*. +Plausible means that aider successfully edited the repo without breaking anything. +As mentioned, aider has integrated support for linting and testing, +so the harness just looks at aider's completion status to see if those +operations finished clean. +Note that *aider only had access to the pre-existing tests in the repo*, +not the held out "acceptance tests" that are used later to see if the +SWE Bench problem was correctly resolved. +- If the solution isn't plausible, the harness launches aider to try again from scratch. +The harness alternates between running aider with GPT-4o and Opus up to three times each, +until it finds a plausible solution. +- If no plausible solution is found, the harness picks the solution +with the least amount of edit/lint/test problems. + +This is all roughly equivalent to a user: + +- Launching aider in their repo with the something like command below, which +tells aider to say yes to every suggestion and use pytest to run tests. + - `aider --yes --test-cmd pytest` +- Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like: + - `/web https://github.com/django/django/issues/XXX` +- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time. +[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git), +so it's always easy to undo/revert AI changes that don't pan out. + +Of course, outside a benchmark setting it's probably +unwise to let *any* AI agent run unsupervised on your code base. +Aider is intended to be used as an interactive pair-programming chat, +where the user participates to direct aider's work and approve suggestions. +This way the user can offer immediate feedback or corrections if their initial +instructions turn out to be ambiguous, +or if the AI starts going down a wrong path. + +## Aider with GPT-4o alone was SOTA + +Running the entire SWE Bench Lite benchmark using aider with just GPT-4o +achieved a score of 25%. +This was itself a state of the art result, before being surpassed by the main +result being reported here +that uses aider with both GPT-4o & Opus. + +## GPT-4o vs Opus + +The benchmark harness alternated between running aider with GPT-4o and Opus. +The harness proceeded in a fixed order, always starting with GPT-4o and +then alternating with Opus until a plausible solution was found. + +The table below breaks down the 79 solutions which were ultimately +verified as correctly resolving their task. +Some noteworthy observations: + +- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt. +- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus. +- A long tail of solutions continued to be found by both models including on the final, 6th attempt. + + +| Attempt | Model | Number
resolved | Percent
of resolved | Cumulative
percent of
resolved | +|:--------:|------------|---------:|---------:|----:| +| 1 | GPT-4o | 61 | 77.2 | 77.2 +| 2 | Opus | 10 | 12.7 | 89.9 +| 3 | GPT-4o | 3 | 3.8 | 93.7 +| 4 | Opus | 2 | 2.5 | 96.2 +| 5 | GPT-4o | 2 | 2.5 | 98.7 +| 6 | Opus | 1 | 1.3 | 100.0 +|**Total**| | **79** | **100%** | **100%** | + +If we just look at which models produced correct solutions, +we can see that GPT-4o dominates. +This isn't a fair comparison, because GPT-4o always took the first +attempt at solving. +But anecdotal evidence from early runs of the benchmark +supports the observation that GPT-4o is significantly stronger than Opus +for this endeavor. + +| Model | Number resolved | Percent of resolved | +|------------|---------:|---------:| +| GPT-4o | 66 | 83.5 | +| Opus | 13 | 16.5 | +|**Total**| **79** | **100%** | + + +## Repository map, not RAG + +The crucial first step in solving a SWE Bench problem is figuring out +which parts of the repo are relevant and which files need to be edited. +Most coding agents use some combination of RAG, vector search +and arming the LLM with +tools to interactively explore the code base. + +Aider instead uses a +[repository map](https://aider.chat/2023/10/22/repomap.html) +to help the LLM understand the +layout, code structure and content of a git repo. +The repo map is created from the code's AST and call graph +to provide a compact and powerful summary of the entire code base. +The map is constantly +tailored to show +repo context that is relevant to the current state of the chat conversation. + +by performing a graph optimization on the code's call graph. + +When the user asks for a change to their code, the LLM uses the repo map +to decide which files to edit. +The LLM simply returns a normal text response explaining which files +it needs to edit and why. +Aider notices when the LLM mentions filenames from the repo, +and asks the user if they should be added to the chat. +Adding a file to the chat allows the LLM to see the full contents +of the file and edit it. + +
+ +#### Please add a new /factorial/N endpoint. + +To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py. +Please add app.py to the chat so I can proceed with the changes. + +> app.py +> Add these files to the chat? yes + +
+ +This is a convenient and natural workflow for interactive chat, +and it worked well for the SWE Bench tasks. +Each task comes with a “gold” patch, which was created by a human developer +to solve the issue. +Aider successfully identified and added the file from the gold patch +in 70.3% of the benchmark tasks. + +Of course aider is not able to see or use the gold patch +or the files it names in any way. +They were only used to compute this statistic after the benchmarking was completed. + + +## Reliable code editing + +Once files have been selected for editing, +the next step is of course to edit the source code to fix the problem. + +Aider has always had a deep focus on ensuring that LLMs can not just write code, +but reliably *edit* code. +Aider a collection of prompting strategies and code editing backends which have +been honed through +[extensive benchmarking](https://aider.chat/docs/leaderboards/). +These foundational capabilities help ensure that the LLM can not only code up a solution but +also properly integrate it into the existing code base and source files. + +The repository map helps here too, making sure that the LLM +can see relevant classes, functions and variables from the entire repo. +This helps ensure that the project's existing APIs and conventions are +respected when new code is added. + +## Linting and fixing + +[Aider lints code]() +after every LLM edit, and offers to automatically fix +any linting errors. +Aider includes basic linters built with tree-sitter that support +[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py). +These built in linters will detect syntax errors and other fatal problems with the code. + +Users can also configure aider to use their preferred linters. +This allows aider to check for a larger class of problems, keep the code style +aligned with the rest of the repo, etc. +But for the benchmark, aider simply used its built-in linters. + +Aider shows linting errors to the LLM in a novel format, +using the abstract syntax tree (AST) to display relevant code context for each +error. +This context increases the ability of the LLM to understand the problem and +make the correct changes to resolve it. + +
+ +> app.py:23:36: F821 undefined name 'num' +> app.py:41:16: F541 f-string is missing placeholders +> +> app.py: +> ...⋮... +> 6│class LongNum: +> 7│ def __init__(self, num): +> 8│ """ +> 9│ Initialize the number. +> 10│ """ +> ...⋮... +> 19│ def __str__(self): +> 20│ """ +> 21│ Render the number as a string. +> 22│ """ +> 23█ return str(num) +> 24│ +> 25│ +> 26│@app.route('/subtract//') +> ...⋮... +> 38│@app.route('/divide//') +> 39│def divide(x, y): +> 40│ if y == 0: +> 41█ return f"Error: Cannot divide by zero" +> 42│ else: +> 43│ result = x / y +> 44│ return str(result) +> 45│ +> ...⋮... +> +> Attempt to fix lint errors? yes + +
+ + +## Testing and fixing + +Aider can be configured with the command needed to run tests for a repo. +A user working on a python project might do that by launching +aider like this: + +``` +aider --test-cmd pytest +``` + +The repositories that are used in the SWE Bench problems are large open +source projects with extensive existing test suites. +A repo's test suite can be run in three ways: + +1. Run tests as they existed before trying to solve the problem, without any changes. +2. Run tests after aider has modified the repo. +So the pre-existing test cases are still present, but may have been modified by aider. +Aider may have also added new tests. +3. Run the final "acceptance tests" to judge if the coding agent has +successfully resolved the problem. +SWE Bench verifies both pre-existing tests and a set of held out acceptance tests +(from the so called `test_patch`) +to check that the issue is properly resolved. During this final acceptance testing, +any aider edits to tests are discard to ensure a faithful test of whether the +issue was resolved. + +For the benchmark, aider is configured with a test command that will run the tests +as described in (2) above. +So testing will fail if aider has broken any pre-existing tests or if any new +tests that it created aren't passing. +When aider runs a test command, it checks for a non-zero exit status. +In this case, +aider will automatically +share the test output with the LLM and ask it to +try and resolve the test failures. + +To be clear, *aider can not run or even see the "acceptance tests"* from the `test_patch` +as described in (3). +Those tests are only run outside of aider and the benchmark harness, +to compute the final benchmark score. + + + +## Finding a plausible solution + +As aider executes, it notes the outcome of the editing, linting and testing +steps. +When aider completes, it returns their final status as either: +succeeded with no errors remaining, +or ended without resolving all errors. + +The benchmark harness uses these outcomes to determine if it has a plausible +solution to the current SWE Bench task. +A plausible solution is one where aider +returns saying that it +edited the repo with no outstanding +edit, lint or test errors. +In this case, aider's changes are taken as the proposed solution and recorded +as the SWE Bench `model_patch` to be evaluated later with the +`test_patch` "acceptance tests". + +If the solution is not plausible, another +instance of aider is launched again from scratch on the same problem. +The harness alternates asking GPT-4o and Opus to solve the problem, +and gives each model three attempts -- for a total of six attempts. +As soon as a plausible solution is found, it is accepted and the +harness moves on to the next SWE Bench instance. + +It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained. + +If all six attempts fail to produce a plausible solution, +then the "best" solution available is selected as a the +`model_patch`. +Which of the non-plausible solutions to use is determined +by ignoring the testing outcome +and prioritizing solutions in the following order: + + - Pick a solution where editing and linting were completed successfully. + - Pick a solution where editing was at least partially successful and linting succeeded. + - Pick a solution where editing was successful. + - Pick a solution where editing was at least partially successful. + +## Computing the benchmark score + +The benchmark harness produces one "best" solution for each of the 300 +SWE Bench Lite instances, and saves it as a `model_patch`. +A separate evaluation script uses the SWE Bench support code to +test each of these results with the acceptance tests. + +These `test_patch` acceptance tests are only ever run outside of aider +and the benchmark harness, and only to compute the number of +correctly resolved instances. +They are never run, used or even visible during the attempts to solve the problems. + +Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%. + +## Acknowledgments + +Much thanks to the team behind the +[SWE Bench](https://www.swebench.com) +family of AI coding benchmarks. +Also thanks to Albert Örwall who has +[dockerized the SWE Bench evaluation scripts](SWE-bench-docker) +making it faster, easier and more reliable to run the acceptance tests. + +