diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md
new file mode 100644
index 000000000..4021415f4
--- /dev/null
+++ b/_posts/2024-05-22-swe-bench-lite.md
@@ -0,0 +1,364 @@
+---
+title: Aider scores SOTA 26.3% on SWE Bench Lite
+excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
+draft: true
+---
+
+# Aider scores SOTA 26.3% on SWE Bench Lite
+
+[Aider scored 26.3%]()
+on the
+[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
+The current top leaderboard entry is 20.33%
+from Amazon Q Developer Agent.
+The best result reported elsewhere online seems to be
+[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
+
+Aider achieved this result mainly through its focus on static code analysis,
+reliable LLM code editing
+and pragmatic workflows for interactive pair programming with AI.
+Aider intentionally has quite limited and narrow "agentic behavior":
+it doesn't require a highly detailed upfront "spec" from the user,
+use RAG or vector search, farm out sub-problems to an army of LLMs,
+allow the LLM to use tools
+or perform web searches,
+etc.
+
+Aider is first and foremost a tool for engineers to get real work done in
+real code bases through a pair programming chat style interface.
+In normal use, the user is in full interactive control.
+This lets them quickly steer misunderstandings back on course and
+avoid wasted time, code reviews and token costs.
+When a user asks aider for a change, they see the edits performed in real-time.
+Aider may also then offer additional
+help like fixing lint or test errors.
+
+For the benchmark,
+aider was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user".
+After that aider runs as normal, with the following modifications:
+
+- Aider's suggestions were always accepted.
+When chatting, aider will suggest which files in the repo may need to be edited based on
+the conversation.
+It will offer to lint code that has been edited,
+and to fix any issues uncovered.
+Aider has workflows to run the repo's test suite and resolve failing tests.
+Normally the user is asked to approved such suggestions, but
+they were always accepted during the benchmark.
+- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
+Plausible means that aider successfully edited the repo without breaking anything.
+As mentioned, aider has integrated support for linting and testing,
+so the harness just looks at aider's completion status to see if those
+operations finished clean.
+Note that *aider only had access to the pre-existing tests in the repo*,
+not the held out "acceptance tests" that are used later to see if the
+SWE Bench problem was correctly resolved.
+- If the solution isn't plausible, the harness launches aider to try again from scratch.
+The harness alternates between running aider with GPT-4o and Opus up to three times each,
+until it finds a plausible solution.
+- If no plausible solution is found, the harness picks the solution
+with the least amount of edit/lint/test problems.
+
+This is all roughly equivalent to a user:
+
+- Launching aider in their repo with the something like command below, which
+tells aider to say yes to every suggestion and use pytest to run tests.
+ - `aider --yes --test-cmd pytest`
+- Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like:
+ - `/web https://github.com/django/django/issues/XXX`
+- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
+[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
+so it's always easy to undo/revert AI changes that don't pan out.
+
+Of course, outside a benchmark setting it's probably
+unwise to let *any* AI agent run unsupervised on your code base.
+Aider is intended to be used as an interactive pair-programming chat,
+where the user participates to direct aider's work and approve suggestions.
+This way the user can offer immediate feedback or corrections if their initial
+instructions turn out to be ambiguous,
+or if the AI starts going down a wrong path.
+
+## Aider with GPT-4o alone was SOTA
+
+Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
+achieved a score of 25%.
+This was itself a state of the art result, before being surpassed by the main
+result being reported here
+that uses aider with both GPT-4o & Opus.
+
+## GPT-4o vs Opus
+
+The benchmark harness alternated between running aider with GPT-4o and Opus.
+The harness proceeded in a fixed order, always starting with GPT-4o and
+then alternating with Opus until a plausible solution was found.
+
+The table below breaks down the 79 solutions which were ultimately
+verified as correctly resolving their task.
+Some noteworthy observations:
+
+- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
+- ~90% of valid solutions were found after one attempt from aider with GPT-4o and Opus.
+- A long tail of solutions continued to be found by both models including on the final, 6th attempt.
+
+
+| Attempt | Model | Number
resolved | Percent
of resolved | Cumulative
percent of
resolved |
+|:--------:|------------|---------:|---------:|----:|
+| 1 | GPT-4o | 61 | 77.2 | 77.2
+| 2 | Opus | 10 | 12.7 | 89.9
+| 3 | GPT-4o | 3 | 3.8 | 93.7
+| 4 | Opus | 2 | 2.5 | 96.2
+| 5 | GPT-4o | 2 | 2.5 | 98.7
+| 6 | Opus | 1 | 1.3 | 100.0
+|**Total**| | **79** | **100%** | **100%** |
+
+If we just look at which models produced correct solutions,
+we can see that GPT-4o dominates.
+This isn't a fair comparison, because GPT-4o always took the first
+attempt at solving.
+But anecdotal evidence from early runs of the benchmark
+supports the observation that GPT-4o is significantly stronger than Opus
+for this endeavor.
+
+| Model | Number resolved | Percent of resolved |
+|------------|---------:|---------:|
+| GPT-4o | 66 | 83.5 |
+| Opus | 13 | 16.5 |
+|**Total**| **79** | **100%** |
+
+
+## Repository map, not RAG
+
+The crucial first step in solving a SWE Bench problem is figuring out
+which parts of the repo are relevant and which files need to be edited.
+Most coding agents use some combination of RAG, vector search
+and arming the LLM with
+tools to interactively explore the code base.
+
+Aider instead uses a
+[repository map](https://aider.chat/2023/10/22/repomap.html)
+to help the LLM understand the
+layout, code structure and content of a git repo.
+The repo map is created from the code's AST and call graph
+to provide a compact and powerful summary of the entire code base.
+The map is constantly
+tailored to show
+repo context that is relevant to the current state of the chat conversation.
+
+by performing a graph optimization on the code's call graph.
+
+When the user asks for a change to their code, the LLM uses the repo map
+to decide which files to edit.
+The LLM simply returns a normal text response explaining which files
+it needs to edit and why.
+Aider notices when the LLM mentions filenames from the repo,
+and asks the user if they should be added to the chat.
+Adding a file to the chat allows the LLM to see the full contents
+of the file and edit it.
+
+
+
+#### Please add a new /factorial/N endpoint.
+
+To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
+Please add app.py to the chat so I can proceed with the changes.
+
+> app.py
+> Add these files to the chat? yes
+
+
+
+This is a convenient and natural workflow for interactive chat,
+and it worked well for the SWE Bench tasks.
+Each task comes with a “gold” patch, which was created by a human developer
+to solve the issue.
+Aider successfully identified and added the file from the gold patch
+in 70.3% of the benchmark tasks.
+
+Of course aider is not able to see or use the gold patch
+or the files it names in any way.
+They were only used to compute this statistic after the benchmarking was completed.
+
+
+## Reliable code editing
+
+Once files have been selected for editing,
+the next step is of course to edit the source code to fix the problem.
+
+Aider has always had a deep focus on ensuring that LLMs can not just write code,
+but reliably *edit* code.
+Aider a collection of prompting strategies and code editing backends which have
+been honed through
+[extensive benchmarking](https://aider.chat/docs/leaderboards/).
+These foundational capabilities help ensure that the LLM can not only code up a solution but
+also properly integrate it into the existing code base and source files.
+
+The repository map helps here too, making sure that the LLM
+can see relevant classes, functions and variables from the entire repo.
+This helps ensure that the project's existing APIs and conventions are
+respected when new code is added.
+
+## Linting and fixing
+
+[Aider lints code]()
+after every LLM edit, and offers to automatically fix
+any linting errors.
+Aider includes basic linters built with tree-sitter that support
+[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
+These built in linters will detect syntax errors and other fatal problems with the code.
+
+Users can also configure aider to use their preferred linters.
+This allows aider to check for a larger class of problems, keep the code style
+aligned with the rest of the repo, etc.
+But for the benchmark, aider simply used its built-in linters.
+
+Aider shows linting errors to the LLM in a novel format,
+using the abstract syntax tree (AST) to display relevant code context for each
+error.
+This context increases the ability of the LLM to understand the problem and
+make the correct changes to resolve it.
+
+
+
+> app.py:23:36: F821 undefined name 'num'
+> app.py:41:16: F541 f-string is missing placeholders
+>
+> app.py:
+> ...⋮...
+> 6│class LongNum:
+> 7│ def __init__(self, num):
+> 8│ """
+> 9│ Initialize the number.
+> 10│ """
+> ...⋮...
+> 19│ def __str__(self):
+> 20│ """
+> 21│ Render the number as a string.
+> 22│ """
+> 23█ return str(num)
+> 24│
+> 25│
+> 26│@app.route('/subtract//')
+> ...⋮...
+> 38│@app.route('/divide//')
+> 39│def divide(x, y):
+> 40│ if y == 0:
+> 41█ return f"Error: Cannot divide by zero"
+> 42│ else:
+> 43│ result = x / y
+> 44│ return str(result)
+> 45│
+> ...⋮...
+>
+> Attempt to fix lint errors? yes
+
+
+
+
+## Testing and fixing
+
+Aider can be configured with the command needed to run tests for a repo.
+A user working on a python project might do that by launching
+aider like this:
+
+```
+aider --test-cmd pytest
+```
+
+The repositories that are used in the SWE Bench problems are large open
+source projects with extensive existing test suites.
+A repo's test suite can be run in three ways:
+
+1. Run tests as they existed before trying to solve the problem, without any changes.
+2. Run tests after aider has modified the repo.
+So the pre-existing test cases are still present, but may have been modified by aider.
+Aider may have also added new tests.
+3. Run the final "acceptance tests" to judge if the coding agent has
+successfully resolved the problem.
+SWE Bench verifies both pre-existing tests and a set of held out acceptance tests
+(from the so called `test_patch`)
+to check that the issue is properly resolved. During this final acceptance testing,
+any aider edits to tests are discard to ensure a faithful test of whether the
+issue was resolved.
+
+For the benchmark, aider is configured with a test command that will run the tests
+as described in (2) above.
+So testing will fail if aider has broken any pre-existing tests or if any new
+tests that it created aren't passing.
+When aider runs a test command, it checks for a non-zero exit status.
+In this case,
+aider will automatically
+share the test output with the LLM and ask it to
+try and resolve the test failures.
+
+To be clear, *aider can not run or even see the "acceptance tests"* from the `test_patch`
+as described in (3).
+Those tests are only run outside of aider and the benchmark harness,
+to compute the final benchmark score.
+
+
+
+## Finding a plausible solution
+
+As aider executes, it notes the outcome of the editing, linting and testing
+steps.
+When aider completes, it returns their final status as either:
+succeeded with no errors remaining,
+or ended without resolving all errors.
+
+The benchmark harness uses these outcomes to determine if it has a plausible
+solution to the current SWE Bench task.
+A plausible solution is one where aider
+returns saying that it
+edited the repo with no outstanding
+edit, lint or test errors.
+In this case, aider's changes are taken as the proposed solution and recorded
+as the SWE Bench `model_patch` to be evaluated later with the
+`test_patch` "acceptance tests".
+
+If the solution is not plausible, another
+instance of aider is launched again from scratch on the same problem.
+The harness alternates asking GPT-4o and Opus to solve the problem,
+and gives each model three attempts -- for a total of six attempts.
+As soon as a plausible solution is found, it is accepted and the
+harness moves on to the next SWE Bench instance.
+
+It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained.
+
+If all six attempts fail to produce a plausible solution,
+then the "best" solution available is selected as a the
+`model_patch`.
+Which of the non-plausible solutions to use is determined
+by ignoring the testing outcome
+and prioritizing solutions in the following order:
+
+ - Pick a solution where editing and linting were completed successfully.
+ - Pick a solution where editing was at least partially successful and linting succeeded.
+ - Pick a solution where editing was successful.
+ - Pick a solution where editing was at least partially successful.
+
+## Computing the benchmark score
+
+The benchmark harness produces one "best" solution for each of the 300
+SWE Bench Lite instances, and saves it as a `model_patch`.
+A separate evaluation script uses the SWE Bench support code to
+test each of these results with the acceptance tests.
+
+These `test_patch` acceptance tests are only ever run outside of aider
+and the benchmark harness, and only to compute the number of
+correctly resolved instances.
+They are never run, used or even visible during the attempts to solve the problems.
+
+Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
+
+## Acknowledgments
+
+Much thanks to the team behind the
+[SWE Bench](https://www.swebench.com)
+family of AI coding benchmarks.
+Also thanks to Albert Örwall who has
+[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
+making it faster, easier and more reliable to run the acceptance tests.
+
+