move into website/

2025-06-01 18:25:00 +00:00 · 2024-06-05 14:28:39 -07:00 · 2024-06-05 14:28:39 -07:00 · 56519361e2
commit 56519361e2
parent 5a4d38418d
103 changed files with 5 additions and 12 deletions
--- a/website/_posts/2023-05-25-ctags.md
+++ b/website/_posts/2023-05-25-ctags.md
@ -0,0 +1 @@
+../docs/ctags.md
--- a/website/_posts/2023-07-02-benchmarks.md
+++ b/website/_posts/2023-07-02-benchmarks.md
@ -0,0 +1 @@
+../docs/benchmarks.md
--- a/website/_posts/2023-10-22-repomap.md
+++ b/website/_posts/2023-10-22-repomap.md
@ -0,0 +1 @@
+../docs/repomap.md
--- a/website/_posts/2023-11-06-benchmarks-1106.md
+++ b/website/_posts/2023-11-06-benchmarks-1106.md
@ -0,0 +1 @@
+../docs/benchmarks-1106.md
--- a/website/_posts/2023-11-06-benchmarks-speed-1106.md
+++ b/website/_posts/2023-11-06-benchmarks-speed-1106.md
@ -0,0 +1 @@
+../docs/benchmarks-speed-1106.md
--- a/website/_posts/2023-12-21-unified-diffs.md
+++ b/website/_posts/2023-12-21-unified-diffs.md
@ -0,0 +1 @@
+../docs/unified-diffs.md
--- a/website/_posts/2024-01-25-benchmarks-0125.md
+++ b/website/_posts/2024-01-25-benchmarks-0125.md
@ -0,0 +1 @@
+../docs/benchmarks-0125.md
--- a/website/_posts/2024-03-08-claude-3.md
+++ b/website/_posts/2024-03-08-claude-3.md
@ -0,0 +1,89 @@
+---
+title: Claude 3 beats GPT-4 on Aider's code editing benchmark
+excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
+highlight_image: /assets/2024-03-07-claude-3.jpg
+nav_exclude: true
+---
+# Claude 3 beats GPT-4 on Aider's code editing benchmark
+
+[![benchmark results](/assets/2024-03-07-claude-3.svg)](https://aider.chat/assets/2024-03-07-claude-3.svg)
+
+[Anthropic just released their new Claude 3 models](https://www.anthropic.com/news/claude-3-family)
+with evals showing better performance on coding tasks.
+With that in mind, I've been benchmarking the new models
+using Aider's code editing benchmark suite.
+
+Claude 3 Opus outperforms all of OpenAI's models,
+making it the best available model for pair programming with AI.
+
+To use Claude 3 Opus with aider:
+
+```
+pip install aider-chat
+export ANTHROPIC_API_KEY=sk-...
+aider --opus
+```
+
+## Aider's code editing benchmark
+
+[Aider](https://github.com/paul-gauthier/aider)
+is an open source command line chat tool that lets you
+pair program with AI on code in your local git repo.
+
+Aider relies on a
+[code editing benchmark](https://aider.chat/docs/benchmarks.html)
+to quantitatively evaluate how well
+an LLM can make changes to existing code.
+The benchmark uses aider to try and complete
+[133 Exercism Python coding exercises](https://github.com/exercism/python).
+For each exercise,
+Exercism provides a starting python file with stubs for the needed functions,
+a natural language description of the problem to solve
+and a test suite to evaluate whether the coder has correctly solved the problem.
+
+The LLM gets two tries to solve each problem:
+
+1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
+2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
+
+## Benchmark results
+
+### Claude 3 Opus
+
+- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
+- Its single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%.
+- While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
+
+### Claude 3 Sonnet
+
+- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
+
+## Code editing
+
+It's highly desirable to have the LLM send back code edits as
+some form of diffs, rather than having it send back an updated copy of the
+entire source code.
+
+Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back
+updated copies of entire source files.
+Aider uses more efficient
+[search/replace blocks](https://aider.chat/2023/07/02/benchmarks.html#diff)
+with the original GPT-4
+and
+[unified diffs](https://aider.chat/2023/12/21/unified-diffs.html#unified-diff-editing-format)
+with the newer GPT-4 Turbo models.
+
+Claude 3 Opus works best with the search/replace blocks, allowing it to send back
+code changes efficiently.
+Unfortunately, the Sonnet model was only able to work reliably with whole files,
+which limits it to editing smaller source files and uses more tokens, money and time.
+
+## Other observations
+
+There are a few other things worth noting:
+
+- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
+- Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
+- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
+- The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it's a sign that Anthropic made be struggling under surging demand.
+
--- a/website/_posts/2024-04-09-gpt-4-turbo.md
+++ b/website/_posts/2024-04-09-gpt-4-turbo.md
@ -0,0 +1,70 @@
+---
+title: GPT-4 Turbo with Vision is a step backwards for coding
+excerpt: OpenAI's GPT-4 Turbo with Vision model scores worse on aider's code editing benchmarks than all the previous GPT-4 models. In particular, it seems much more prone to "lazy coding" than the existing GPT-4 Turbo "preview" models.
+highlight_image: /assets/2024-04-09-gpt-4-turbo-laziness.jpg
+nav_exclude: true
+---
+# GPT-4 Turbo with Vision is a step backwards for coding
+
+[OpenAI just released GPT-4 Turbo with Vision](https://twitter.com/OpenAIDevs/status/1777769463258988634)
+and it performs worse on aider's coding benchmark suites than all the previous GPT-4 models.
+In particular, it seems much more prone to "lazy coding" than the
+existing GPT-4 Turbo "preview" models.
+
+## Code editing skill
+
+[![benchmark results](/assets/2024-04-09-gpt-4-turbo.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo.svg)
+
+Aider relies on a
+[code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark)
+to quantitatively evaluate how well
+an LLM can make changes to existing code.
+The benchmark uses aider to try and complete
+[133 Exercism Python coding exercises](https://github.com/exercism/python).
+
+For each exercise, the LLM gets two tries to solve each problem:
+
+1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
+2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
+
+**GPT-4 Turbo with Vision
+scores only 62% on this benchmark,
+the lowest score of any of the existing GPT-4 models.**
+The other models scored 63-66%, so this represents only a small
+regression, and is likely statistically insignificant when compared
+against `gpt-4-0613`.
+
+## Lazy coding
+
+[![benchmark results](/assets/2024-04-09-gpt-4-turbo-laziness.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo-laziness.svg)
+
+The GPT-4 Turbo "preview" models have been widely criticized for being "lazy"
+when coding.
+They often omit needed code
+and instead leave comments with homework assignments like "implement method here".
+
+```
+def some_complex_method(foo, bar):
+    # ... implement method here ...
+```
+
+Aider uses a ["laziness" benchmark suite](https://github.com/paul-gauthier/refactor-benchmark)
+which is designed to both provoke and quantify lazy coding.
+It consists of
+89 python refactoring tasks
+which tend to make GPT-4 Turbo code in that lazy manner.
+
+**The new GPT-4 Turbo with Vision model scores only 34% on aider's
+refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
+by a significant margin.**
+
+# Conclusions
+
+Aider has full support for the new GPT-4 Turbo with Vision
+model, which you can access using the switch `--model gpt-4-turbo-2024-04-09`.
+But aider will continue to use `gpt-4-1106-preview` by default,
+as it is by far the strongest coder of the GPT-4 models.
+
+
+
+
--- a/website/_posts/2024-05-02-browser.md
+++ b/website/_posts/2024-05-02-browser.md
@ -0,0 +1,52 @@
+---
+title: Aider in your browser
+excerpt: Aider has an experimental browser UI, allowing you to collaborate with LLMs on code in your local git repo.
+highlight_image: /assets/browser.jpg
+nav_order: 800
+---
+# Aider in your browser
+
+<div class="video-container">
+  <video controls loop poster="/assets/browser.jpg">
+    <source src="/assets/aider-browser-social.mp4" type="video/mp4">
+    <a href="/assets/aider-browser-social.mp4">Aider browser UI demo video</a>
+  </video>
+</div>
+
+<style>
+.video-container {
+  position: relative;
+  padding-bottom: 101.89%; /* 1080 / 1060 = 1.0189 */
+  height: 0;
+  overflow: hidden;
+}
+
+.video-container video {
+  position: absolute;
+  top: 0;
+  left: 0;
+  width: 100%;
+  height: 100%;
+}
+</style>
+
+Use aider's new experimental browser UI to collaborate with LLMs
+to edit code in your local git repo.
+Aider will directly edit the code in your local source files,
+and [git commit the changes](https://aider.chat/docs/faq.html#how-does-aider-use-git)
+with sensible commit messages.
+You can start a new project or work with an existing git repo.
+Aider works well with GPT 3.5, GPT-4, GPT-4 Turbo with Vision,
+and Claude 3 Opus.
+It also supports [connecting to almost any LLM](https://aider.chat/docs/llms.html).
+
+Use the `--browser` switch to launch the browser version of aider:
+
+```
+pip install aider-chat
+
+export OPENAI_API_KEY=<key> # Mac/Linux
+setx   OPENAI_API_KEY <key> # Windows
+
+aider --browser
+```
--- a/website/_posts/2024-05-13-models-over-time.md
+++ b/website/_posts/2024-05-13-models-over-time.md
@ -0,0 +1,324 @@
+---
+title: Drawing graphs with aider, GPT-4o and matplotlib
+excerpt: Use GPT-4o to draw graphs with matplotlib, including adjusting styles and making visual changes. You get the graph, but you also get the code in your repo.
+highlight_image: /assets/models-over-time.png
+nav_exclude: true
+---
+
+![LLM coding skill over time](/assets/models-over-time.svg)
+
+# Drawing graphs with aider, GPT-4o and matplotlib
+
+I recently wanted to draw a graph showing how LLM code editing skill has been
+changing over time as new models have been released by OpenAI, Anthropic and others.
+I have all the
+[data in a yaml file](https://github.com/paul-gauthier/aider/blob/main/_data/edit_leaderboard.yml) that is used to render
+[aider's LLM leaderboards](https://aider.chat/docs/leaderboards/).
+
+Below is the aider chat transcript, which shows:
+
+- I launch aider with the yaml file, a file with other plots I've done recently (so GPT can crib the style) and an empty file called `over_time.py`.
+- Then I ask GPT to draw the scatterplot I want.
+- I run the resulting script and share the error output with GPT so it can fix a small bug.
+- I ask it to color the points for GPT-4 and GPT-3.5 family models differently, to better see trends within those model families.
+- I add a PNG of the graph to the chat, so GPT can see the current graph and understand why I want it to move the point labels so they don't overlap the points.
+- I work through a series of other small style changes, like changing fonts and the graph border.
+
+In the end I have the graph, but I also have the python code in my repo.
+So I can update this graph easily whenever I add new entries to the yaml data file.
+
+
+## Aider chat transcript
+
+<div class="chat-transcript" markdown="1">
+
+> Aider v0.35.1-dev  
+> Models: gpt-4o with diff edit format, weak model gpt-3.5-turbo  
+> Added _data/edit_leaderboard.yml to the chat.  
+> Added benchmark/over_time.py to the chat.  
+> Added benchmark/plots.py to the chat.  
+
+#### in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.  
+
+To draw a scatter plot for each of the YAML entries with a `released` date, we need to:
+
+1. Read the YAML file.
+2. Extract the `released` date and `pass_rate_2` for each entry.
+3. Plot the data using matplotlib.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+=======
+import matplotlib.pyplot as plt
+import yaml
+from datetime import datetime
+
+def plot_over_time(yaml_file):
+    with open(yaml_file, 'r') as file:
+        data = yaml.safe_load(file)
+
+    dates = []
+    pass_rates = []
+    models = []
+
+    for entry in data:
+        if 'released' in entry and 'pass_rate_2' in entry:
+            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
+            pass_rates.append(entry['pass_rate_2'])
+            models.append(entry['model'])
+
+    plt.figure(figsize=(10, 6))
+    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
+
+    for i, model in enumerate(models):
+        plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
+
+    plt.xlabel('Release Date')
+    plt.ylabel('Pass Rate 2')
+    plt.title('Model Performance Over Time')
+    plt.grid(True)
+    plt.tight_layout()
+    plt.show()
+
+# Example usage
+plot_over_time('_data/edit_leaderboard.yml')
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit 49ac9ca aider: Added scatter plot functionality for YAML entries with a `released` date and `pass_rate_2`.  
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### /run python benchmark/over_time.py  
+> Traceback (most recent call last):
+  File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
+    plot_over_time('_data/edit_leaderboard.yml')
+  File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
+    dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
+                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+TypeError: strptime() argument 1 must be str, not datetime.date  
+> Add the output to the chat? y  
+> Traceback (most recent call last):  
+> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>  
+> plot_over_time('_data/edit_leaderboard.yml')  
+> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time  
+> dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))  
+> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
+> TypeError: strptime() argument 1 must be str, not datetime.date  
+
+The error indicates that the `released` field in the YAML file is already a `datetime.date` object, so we don't need to parse it with `datetime.strptime`. We can directly use it in the plot.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
+=======
+            dates.append(entry['released'])
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit af2f4b7 aider: Use the `released` field directly in the plot instead of parsing it with `datetime.strptime`.  
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### color the "gpt-4" models differently  
+
+To color the "gpt-4" models differently, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-4" in the model name and assign a different color.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
+=======
+    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
+    plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
+>>>>>>> REPLACE
+```
+
+This change will color the points red if the model name contains "gpt-4", otherwise, it will color them blue.
+
+> Applied edit to benchmark/over_time.py  
+> Commit 0d11357 aider: Color the "gpt-4" models differently.  
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### color the gpt-3.5 models a new color  
+
+To color the "gpt-3.5" models a new color, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-3.5" in the model name and assign a different color.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
+=======
+    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit 85d9d84 aider: Color the "gpt-3.5" models green in the plotting code.  
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### use the colorscheme and fonts from the other graphs in plots.py  
+
+To use the colorscheme and fonts from the graphs in `plots.py`, we need to:
+
+1. Import the necessary settings from `plots.py`.
+2. Apply these settings to the graph in `over_time.py`.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+import yaml
+from datetime import datetime
+=======
+import yaml
+from datetime import datetime
+from matplotlib import rc
+>>>>>>> REPLACE
+```
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    plt.figure(figsize=(10, 6))
+=======
+    plt.rcParams["hatch.linewidth"] = 0.5
+    plt.rcParams["hatch.color"] = "#444444"
+
+    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
+
+    fig, ax = plt.subplots(figsize=(6, 4))
+    ax.grid(axis="y", zorder=0, lw=0.2)
+>>>>>>> REPLACE
+```
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
+=======
+    ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
+>>>>>>> REPLACE
+```
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+        plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
+
+    plt.xlabel('Release Date')
+    plt.ylabel('Pass Rate 2')
+    plt.title('Model Performance Over Time')
+    plt.grid(True)
+    plt.tight_layout()
+    plt.show()
+=======
+        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
+
+    ax.set_xlabel('Release Date')
+    ax.set_ylabel('Pass Rate 2')
+    ax.set_title('Model Performance Over Time')
+    plt.tight_layout()
+    plt.savefig("tmp_over_time.svg")
+    imgcat(fig)
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit 8d670f1 aider: Used the colorscheme and fonts from the graphs in `plots.py` in `over_time.py`.  
+
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### /add tmp_over_time.png  
+> Added /Users/gauthier/Projects/aider/tmp_over_time.png to the chat  
+
+#### make the labels not overlap the dots  
+
+To make the labels not overlap the dots, we can adjust the position of the labels slightly using the `xytext` parameter in the `annotate` function.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
+=======
+        ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75,
+                    xytext=(5, 5), textcoords='offset points')
+>>>>>>> REPLACE
+```
+
+> Committing benchmark/over_time.py before applying edits.  
+> Commit 49eb72c Updated image file format from SVG to PNG and added imgcat library for displaying the plot.  
+> Applied edit to benchmark/over_time.py  
+> Commit 75fa061 aider: Adjust the position of labels to prevent overlap with dots.  
+
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### make the dots bigger  
+
+To make the dots bigger, we can adjust the `s` parameter in the `scatter` function, which controls the size of the dots.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
+=======
+    ax.scatter(dates, pass_rates, c=colors, alpha=0.5, s=100)
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit a018fdf aider: Make the dots bigger.  
+
+</div>
+<div class="chat-transcript" markdown="1">
+
+#### make the border of the graph thinner and lighter  
+
+To make the border of the graph thinner and lighter, we can adjust the `spines` properties of the axes.
+
+Here are the changes:
+
+benchmark/over_time.py
+```python
+<<<<<<< SEARCH
+    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
+=======
+    for spine in ax.spines.values():
+        spine.set_edgecolor('#DDDDDD')
+        spine.set_linewidth(0.5)
+
+    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
+>>>>>>> REPLACE
+```
+
+> Applied edit to benchmark/over_time.py  
+> Commit 77ba518 aider: Made the border of the graph thinner and lighter by adjusting the `spines` properties of the axes.  
+
+</div>
--- a/website/_posts/2024-05-22-draft.md
+++ b/website/_posts/2024-05-22-draft.md
@ -0,0 +1,11 @@
+---
+title: A draft post.
+excerpt: With a draft summary.
+highlight_image: /assets/linting.jpg
+draft: true
+nav_exclude: true
+---
+
+# A draft post
+
+Content TBD.
--- a/website/_posts/2024-05-22-linting.md
+++ b/website/_posts/2024-05-22-linting.md
@ -0,0 +1,146 @@
+---
+title: Linting code for LLMs with tree-sitter
+excerpt: Aider now lints code after every LLM edit and automatically fixes errors, using tree-sitter and AST-aware code context.
+highlight_image: /assets/linting.jpg
+nav_exclude: true
+---
+
+[![Linting code](/assets/linting.jpg)](https://aider.chat/assets/linting.jpg)
+
+# Linting code for LLMs with tree-sitter
+
+Aider now lints your code after every LLM edit, and offers to automatically fix
+any linting errors.
+You can also use aider's lint-and-fix functionality on your source files any time
+you like, to speedily resolve issues with code written by humans.
+
+Aider shows linting errors to the LLM in a novel format,
+using tree-sitter
+to help display relevant code context for each
+error.
+This increases the ability of the LLM to understand the problem and
+make the correct changes to resolve it.
+
+Aider ships with basic linters built with tree-sitter that support
+[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
+These built in linters will detect syntax errors and other fatal problems with the code.
+
+You can also configure aider to use your preferred linters.
+This allows aider to check for a larger class of problems, keep the code style
+aligned with the rest of your team, etc.
+
+## Linting and fixing your code
+
+Aider now lints each source file after it applies the edits
+suggested by an LLM.
+If problems are found, aider will ask if you'd like it to
+attempt to fix the errors.
+If so, aider will send the LLM a report of the lint errors
+and request changes to fix them. This process may iterate a few times
+as the LLM works to fully resolve all the issues.
+
+You can also lint and fix files any time, on demand from within the aider chat or via the
+command line:
+
+- The in-chat `/lint` command will lint and fix all the files which have
+been added to the chat by default. Or you can name any files
+in your git repo as arguments.
+- From the command line, you can run `aider --lint` to lint and fix
+all the dirty files in the repo.
+Or you can specify specific filenames on the command line.
+
+
+## An LLM-friendly lint report
+
+Most linting tools produce terse and cryptic output,
+which is one reason many engineers appreciate IDEs that highlight
+linting errors.
+LLM's don't have the luxury of using an IDE, so aider sends
+the linting errors in an LLM friendly format.
+
+Here's an example of raw output of the `flake8` python linter:
+
+```
+app.py:23:36: F821 undefined name 'num'
+app.py:41:16: F541 f-string is missing placeholders
+```
+
+This sort of output depends on the user to reference line numbers to find and fix
+each reported error.
+LLMs are quite bad at working with source code line numbers, often
+making off-by-one errors and other mistakes even when provided with
+a fully numbered code listing.
+
+Aider augments the raw linter by
+displaying and
+highlighting the lines that have errors within their
+containing functions, methods, classes.
+To do this, aider uses tree-sitter to obtain the code's AST and analyzes it
+in light of the linting errors.
+LLMs are more effective at editing code that's provided
+with context like this.
+
+```
+app.py:23:36: F821 undefined name 'num'
+app.py:41:16: F541 f-string is missing placeholders
+
+app.py:
+...⋮...
+  6│class LongNum:
+  7│    def __init__(self, num):
+  8│        """
+  9│        Initialize the number.
+ 10│        """
+...⋮...
+ 19│    def __str__(self):
+ 20│        """
+ 21│        Render the number as a string.
+ 22│        """
+ 23█        return str(num)
+ 24│
+ 25│
+ 26│@app.route('/subtract/<int:x>/<int:y>')
+...⋮...
+ 38│@app.route('/divide/<int:x>/<int:y>')
+ 39│def divide(x, y):
+ 40│    if y == 0:
+ 41█        return f"Error: Cannot divide by zero"
+ 42│    else:
+ 43│        result = x / y
+ 44│        return str(result)
+ 45│
+...⋮...
+```
+
+## Basic linters for most popular languages
+
+Aider comes batteries-included with built in linters for
+[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
+This provides wide support for linting without requiring
+users to manually install a linter and configure it to work with aider.
+
+Aider's built in language-agnostic linter uses tree-sitter to parse
+the AST of each file.
+When tree-sitter encounters a syntax error or other fatal issue
+parsing a source file, it inserts an AST node with type `ERROR`.
+Aider simply uses these `ERROR` nodes to identify all the lines
+with syntax or other types of fatal error, and displays
+them in the LLM friendly format described above.
+
+## Configuring your preferred linters
+
+You can optionally configure aider to use
+your preferred linters with the `--lint-cmd` switch.
+
+```
+# To lint javascript with jslint
+aider --lint-cmd javascript:jslint
+
+# To lint python with flake8 using some specific args:
+aider --lint-cmd "python:flake8 --select=E9,F821,F823..."
+```
+
+You can provide multiple `--lint-cmd` switches
+to set linters for various languages.
+You can also durably set linters in your `.aider.conf.yml` file.
+
--- a/website/_posts/2024-05-22-swe-bench-lite.md
+++ b/website/_posts/2024-05-22-swe-bench-lite.md
@ -0,0 +1,451 @@
+---
+title: How aider scored SOTA 26.3% on SWE Bench Lite
+excerpt: Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
+highlight_image: /assets/swe_bench_lite.jpg
+nav_exclude: true
+---
+
+# How aider scored SOTA 26.3% on SWE Bench Lite
+ 
+[Aider scored 26.3%](https://github.com/swe-bench/experiments/pull/7)
+on the
+[SWE Bench Lite benchmark](https://www.swebench.com),
+achieving a state-of-the-art result. 
+The previous top leaderboard entry was 20.3%
+from Amazon Q Developer Agent.
+
+See also [aider's SOTA result on the main SWE Bench](https://aider.chat/2024/06/02/main-swe-bench.html).
+
+[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
+
+**All of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.**
+All results in the above chart are unhinted pass@1 results.
+Please see the [references](#references)
+for details on the data presented in this chart.
+It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
+using pass@1 results from AutoCodeRover
+and results from OpenDevin that don't use hints.
+The [official SWE Bench Lite leaderboard](https://www.swebench.com)
+only accepts pass@1 results that do not use hints.
+
+## Interactive, not agentic
+
+Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
+Aider intentionally has quite limited and narrow "agentic behavior"
+to avoid long delays, high token costs
+and the need for users to repeatedly code review incorrect solutions.
+It's also worth noting that aider currently does not use
+RAG, vector search, tools or give the LLM access to search the web
+or unilaterally execute code.
+
+Aider is first and foremost an interactive tool for engineers to get real work done in
+real code bases using a chat interface.
+Aider provides a pair programming UX where users can ask for a change
+and see the edits performed in real-time.
+Aider can also offer additional help like fixing lint or test errors,
+but the user is always in full interactive control.
+This lets them quickly steer misunderstandings back on course and
+avoid wasting time and token costs.
+
+
+## Benchmark methodology
+
+For the benchmark, 
+aider was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user."
+After that aider runs as normal, with the following modifications:
+
+- Aider's suggestions were always accepted without user approval.
+- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
+Plausibly correct means that aider reported that it had successfully edited the repo
+without causing syntax errors or breaking any *pre-existing* tests.
+- If the solution isn't plausible, the harness launches aider to try again from scratch,
+alternating between using aider with GPT-4o and Opus.
+- If no plausible solution is found after six tries, the harness picks the solution
+with the fewest edit/lint/test problems.
+
+It's important to be clear that
+*aider and the benchmark harness
+only had access to the pre-existing tests in each problem's repo*.
+The held out "acceptance tests" were *only* used
+after benchmarking to compute statistics on which problems aider
+correctly resolved.
+
+The [full harness to run aider on SWE Bench Lite is available on GitHub](https://github.com/paul-gauthier/aider-swe-bench).
+
+The benchmarking process was similar to how a developer might use aider to
+resolve a GitHub issue:
+
+- They could launch aider in their repo with the command below, which
+tells aider they want to accept every suggestion
+and to use pytest to run tests.
+  - `aider --yes --test-cmd pytest`
+- They could start the chat by pasting in the URL or text of a GitHub issue.
+Aider will pull in the URL's content and then try and solve the issue.
+- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
+[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
+so it's always easy to revert AI changes that don't pan out.
+
+Outside a benchmark setting, it's probably
+unwise or at least highly inefficient
+to let *any* AI agent run unsupervised on your code base.
+The reason aider is intended to be used interactively
+is so that the user can participate and direct aider's work and approve suggestions.
+This way the user can offer immediate feedback or corrections if their initial
+instructions turn out to be ambiguous,
+or if the AI starts going down a wrong path.
+
+## Aider with GPT-4o alone was SOTA
+
+Running the benchmark harness
+only using aider with GPT-4o to find plausible solutions
+achieved a score of 25.0%.
+This was itself matching the state-of-the-art, before being surpassed by the main
+result being reported here
+that used aider with both GPT-4o & Opus.
+
+As noted below, a single attempt using Aider with GPT-4o tied
+the current top entry on the leaderboard.
+
+## Aider with GPT-4o & Opus
+
+The benchmark harness alternated between running aider with GPT-4o and Opus.
+The harness proceeded in a fixed order, always starting with GPT-4o and
+then alternating with Opus until a plausible solution was found for each
+problem.
+
+The table below breaks down the plausible solutions that
+were found for the 300 problems.
+It also provides details on the 79 that were ultimately
+verified as correctly resolving their issue.
+Some noteworthy observations:
+
+- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
+- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark.
+These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
+- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
+
+
+| Attempt | Agent |Number&nbsp;of<br>plausible<br>solutions|Percent&nbsp;of<br>plausible<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
+|:--------:|------------|---------:|---------:|----:|---:|--:|
+| 1 | Aider with GPT-4o    | 208 | 69.3% | 61 | 77.2% | 20.3% |
+| 2 | Aider with Opus      |  49 | 16.3% | 10 | 12.7% |  3.3% |
+| 3 | Aider with GPT-4o    |  20 |  6.7% |  3 |  3.8% |  1.0% |
+| 4 | Aider with Opus      |   9 |  3.0% |  2 |  2.5% |  0.7% |
+| 5 | Aider with GPT-4o    |  11 |  3.7% |  2 |  2.5% |  0.7% |
+| 6 | Aider with Opus      |   3 |  1.0% |  1 |  1.3% |  0.3% |
+| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
+
+
+If we break down the solutions solely by model,
+we can see that aider with GPT-4o outperforms Opus.
+This isn't a fair and direct comparison, because GPT-4o always took the first
+turn and therefore got first crack at all the "easiest" problems.
+Aider with Opus only ever saw problems that GPT-4o failed to
+find plausible solutions for on its first try.
+
+Aider with GPT-4o was producing higher quality plausible solutions,
+with a greater chance of going on to be accepted as resolving the issue.
+Again, this is biased by the turn ordering.
+But other anecdotal evidence from earlier runs of the benchmark
+also supports the observation that aider with GPT-4o is significantly stronger than Opus
+for this benchmark.
+
+
+| Agent      | Number&nbsp;of<br>plausible<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>plausible<br>which<br>correctly<br>resolved<br>| 
+|------------|---------:|---------:|---:|
+| Aider with GPT-4o    | 239 | 66 |27.6% |
+| Aider with Opus      |  61 | 13 |21.3% |
+| **Total** | **300** | **79** |**26.3%** |
+
+## Repository map, not RAG
+
+The crucial first step in solving a SWE Bench problem is figuring out
+which parts of the repo are relevant and which files need to be edited.
+Most coding agents use some combination of RAG, vector search
+and providing the LLM with
+tools to interactively explore the code base.
+
+Aider instead uses a
+[repository map](https://aider.chat/2023/10/22/repomap.html)
+to help the LLM understand the 
+layout, code structure, and content of a git repo.
+The repo map is created through static analysis of the code's
+abstract syntax tree and call graph
+to provide a compact and powerful summary of the entire code base.
+The map is constantly
+tailored to show
+repo context that is relevant to the current state of the chat conversation.
+This is done by performing a graph optimization on the code's call graph.
+
+When the user asks for a change to their code, the LLM can use the repo map
+to decide which files to edit.
+The LLM simply returns a normal text response explaining which files
+it needs to edit and why.
+Aider notices when the LLM mentions filenames from the repo,
+and asks the user if they should be added to the chat.
+Adding a file to the chat allows the LLM to see the full contents
+of the file and edit it.
+
+<div class="chat-transcript" markdown="1">
+
+#### Please add a new /factorial/N endpoint.
+
+To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
+Please add app.py to the chat so I can proceed with the changes.
+
+> app.py  
+> Add these files to the chat? yes
+
+</div>
+
+This is a convenient and natural workflow for interactive chat,
+and it worked well for the SWE Bench problems.
+Aider successfully identified the correct file to edit
+in 70.3% of the benchmark tasks.
+
+We can determine which file needs to be edited using the "gold" patch
+which is associated with each SWE Bench task.
+This patch was created by a human developer
+to solve the issue, and therefore reveals a file which can
+be edited to solve the problem.
+Of course aider is not able to see or use the gold patch
+or the file names it contains in any way.
+This information was only used to compute
+statistics outside the benchmarking process.
+
+
+## Reliable code editing
+
+Once files have been selected for editing,
+the next step is of course to edit the source code to fix the problem.
+
+Aider goes to great lengths to ensure that LLMs can not just write code,
+but reliably *edit* code.
+Aider has a collection of prompting strategies and code editing backends which have
+been honed through
+[extensive benchmarking](https://aider.chat/docs/leaderboards/).
+These foundational capabilities help ensure that aider can
+properly integrate code from LLMs into an existing code base and source files.
+
+The repository map helps here too, making sure that the LLM
+can see relevant classes, functions and variables from the entire repo.
+This helps ensure that the project's existing APIs and conventions are
+respected and utilized when new code is added.
+
+Regardless, there are still cases where aider may be unable to cleanly
+complete the edits specified by the LLM.
+This is usually because the LLM has failed to conform to the editing
+instructions in its system prompt.
+When aider completes, it returns an editing outcome that indicates
+whether it was able to successfully apply all edits.
+The benchmark harness uses this editing status as
+one criteria to determine if aider has
+created a plausible solution.
+
+## Linting and fixing
+
+Another key criteria for a plausible solution is that it passes basic
+linting, which means that the code has no syntax
+or other fatal errors.
+[Aider lints code](https://aider.chat/2024/05/22/linting.html)
+after every LLM edit and offers to automatically fix
+any problems.
+
+Aider ships with built-in linters based on tree-sitter
+which work with most popular programming languages.
+Aider shows linting errors to the LLM in a novel format,
+using the abstract syntax tree to display relevant code context for each
+error.
+This context helps LLMs understand the problem and
+make the correct changes to resolve it.
+
+<div class="chat-transcript" markdown="1">
+
+```
+app.py:23:36: F821 undefined name 'num'  
+  
+app.py:  
+...⋮...  
+  6│class LongNum:  
+...⋮...  
+ 19│    def expound(self, threshold):  
+ 20│        number = self.basis  
+ 21│        while number < threshold:  
+ 22│            number *= self.factor  
+ 23█        return num  
+ 24│  
+ 25│  
+...⋮...  
+```  
+
+> Attempt to fix lint errors? yes
+
+</div>
+
+In the benchmark, these linting suggestions are always accepted.
+At completion,
+aider reports a linting outcome that
+indicates if it was able to produce
+code without any outstanding linting errors.
+The benchmark harness uses this status as
+one of the criteria to determine if aider has
+created a plausible solution.
+
+## Testing and fixing
+
+The final crtieria for a plausible solution is that 
+all tests must be passing.
+Aider can be configured with the command to run tests for a repo,
+and will automatically attempt to fix any test failures.
+
+A user working on a python project might configure testing
+by launching aider like this:
+
+```
+aider --test-cmd pytest
+``` 
+
+For the benchmark, aider is configured with a test command that will run the
+tests that already exist in each problem's repository.
+SWE Bench problems are based on repositories from large open
+source projects with extensive existing test suites.
+This means that
+testing will fail if aider has broken any of these
+pre-existing tests or if any new
+tests that it created aren't passing.
+
+As with editing and linting, aider reports a testing outcome
+that indicates if it completed with any outstanding failing tests.
+The benchmark harness uses this status when deciding if aider
+has produced a plausible solution.
+
+To be clear, *aider cannot run or even see the held out "acceptance tests"* that
+are used to judge if a proposed solution correctly
+resolves the problem.
+Those tests are only run outside of aider and the benchmark harness,
+to compute the final benchmark statistics.
+
+## Finding a plausible solution
+
+Each time aider executes, it reports
+the outcome of the editing, linting, and testing
+steps.
+Each of these steps may complete successfully or
+return a status that indicates that there were outstanding
+problems that remain unresolved.
+
+The benchmark harness uses these outcomes to determine if
+aider has produced a plausible
+solution to the current SWE Bench task.
+A plausible solution is one where aider
+returns saying that it 
+edited the repo with no outstanding
+edit, lint, or test errors.
+In this case, aider's changes are recorded
+as the SWE Bench `model_patch` to be evaluated later with the
+acceptance tests.
+
+If the solution is not plausible, another
+instance of aider is launched again from scratch on the same problem.
+The harness alternates launching aider with GPT-4o and Opus to solve the problem,
+and gives each model three attempts -- for a total of six attempts.
+As soon as a plausible solution is found, it is accepted and the
+harness moves on to the next SWE Bench instance.
+
+It's worth noting that repositories may have lint or test errors
+present before aider even starts to edit them.
+Whether unresolved errors were caused by aider or were pre-existing,
+there will be instances where
+no plausible solution is
+found after six tries.
+
+If all six attempts fail to produce a plausible solution,
+then the "best" solution available is selected as the
+`model_patch`.
+Which of the non-plausible solutions to use is determined
+by ignoring the testing outcome
+and prioritizing solutions in the following order:
+
+ - Pick a solution where editing and linting were completed successfully.
+ - Pick a solution where editing was at least partially successful and linting succeeded.
+ - Pick a solution where editing was successful.
+ - Pick a solution where editing was at least partially successful.
+
+## Computing the benchmark score
+
+The benchmark harness produced a plausible solution for each of the 300
+SWE Bench Lite instances and saved it as the `model_patch`.
+
+A separate evaluation script was used to
+test each of these solutions with the full test suite,
+including the held out acceptance tests.
+For this final acceptance testing, any edits that aider made to tests
+are discarded.
+This ensures that the correct,
+unmodified test suite is used for acceptance testing.
+The evaluation script compares the test results
+with results from testing
+the "gold" patch that was developed by a human to correctly solve the issue.
+If they match, the candidate solution has correctly resolved the issue.
+
+These acceptance tests are only ever run outside of aider
+and the benchmark harness, and only to compute the number of
+correctly resolved instances.
+They are never run, used, or even visible during aider's attempts to solve the problems.
+
+Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
+
+## Acknowledgments
+
+Much thanks to the team behind the
+[SWE Bench](https://www.swebench.com)
+family of AI coding benchmarks.
+Also thanks to Albert Örwall who has
+[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
+making it faster, easier, and more reliable to run the acceptance tests.
+
+
+## References
+
+All of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.
+
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+Thus it is a pass@1 result.
+
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N solution pass, that counts as a pass@N success.
+
+Below are the references for the other pass@1 unhinted SWE-Bench results
+displayed in the graph at the beginning of this article.
+
+- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
+- [19.0% AutoCodeRover](https://www.swebench.com/)
+- [18.0% SWE-Agent + GPT-4](https://www.swebench.com)
+- [16.7% OpenDevin](https://github.com/OpenDevin/OpenDevin/issues/2149)
+- [11.7% SWE-Agent + Opus](https://www.swebench.com)
+
+Note, the graph was corrected on 5/30/24 as follows.
+
+The graph now contains AutoCodeRover's average pass@1 results.
+Previously it displayed pass@3 results, which are
+not comparable
+to the pass@1 results for aider being reported here.
+The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
+features pass@3 results
+without being clearly labeled.
+
+The graph now contains the best OpenDevin results obtained without using
+the SWE Bench `hints_text` to provide hints to the agent.
+The previous graph contained their hinted result,
+which is not comparable
+to the unhinted aider results being reported here.
+[OpenDevin reported hinted results](https://x.com/gneubig/status/1791498953709752405)
+without noting that hints were used.
--- a/website/_posts/2024-05-24-self-assembly.md
+++ b/website/_posts/2024-05-24-self-assembly.md
@ -0,0 +1,67 @@
+---
+title: Aider has written 7% of its own code
+excerpt: Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.
+highlight_image: /assets/self-assembly.jpg
+nav_exclude: true
+---
+
+# Aider has written 7% of its own code
+
+[![self assembly](/assets/self-assembly.jpg)](https://aider.chat/assets/self-assembly.jpg)
+
+The
+[aider git repo](https://github.com/paul-gauthier/aider)
+currently contains about 4K commits and 14K lines of code.
+
+Aider made 15% of the commits, inserting 4.8K and deleting 1.5K lines of code.
+
+About 7% of the code now in the repo is attributable to an aider commit
+using `git blame`.
+This number is probably a significant undercount, because periodic reformatting
+by `black` is likely obscuring aider's authorship of many lines.
+
+Here's the breakdown of the code aider wrote in the current code base
+according to `git blame`.
+
+| File | Lines | Percent |
+|---|---:|---:|
+|aider/args.py| 6 of 449 | 1.3% |
+|aider/coders/base_coder.py| 37 of 1354 | 2.7% |
+|aider/coders/editblock_coder.py| 14 of 507 | 2.8% |
+|aider/coders/editblock_func_coder.py| 6 of 141 | 4.3% |
+|aider/coders/udiff_coder.py| 2 of 421 | 0.5% |
+|aider/coders/wholefile_coder.py| 5 of 146 | 3.4% |
+|aider/coders/wholefile_func_coder.py| 4 of 134 | 3.0% |
+|aider/commands.py| 67 of 703 | 9.5% |
+|aider/diffs.py| 15 of 129 | 11.6% |
+|aider/gui.py| 2 of 533 | 0.4% |
+|aider/history.py| 19 of 124 | 15.3% |
+|aider/io.py| 55 of 368 | 14.9% |
+|aider/linter.py| 30 of 240 | 12.5% |
+|aider/main.py| 30 of 466 | 6.4% |
+|aider/mdstream.py| 3 of 122 | 2.5% |
+|aider/models.py| 22 of 549 | 4.0% |
+|aider/repo.py| 19 of 266 | 7.1% |
+|aider/repomap.py| 17 of 518 | 3.3% |
+|aider/scrape.py| 12 of 199 | 6.0% |
+|aider/versioncheck.py| 10 of 37 | 27.0% |
+|aider/voice.py| 9 of 104 | 8.7% |
+|benchmark/benchmark.py| 33 of 730 | 4.5% |
+|benchmark/over_time.py| 32 of 60 | 53.3% |
+|benchmark/swe_bench_lite.py| 40 of 71 | 56.3% |
+|scripts/blame.py| 55 of 212 | 25.9% |
+|scripts/versionbump.py| 96 of 123 | 78.0% |
+|setup.py| 11 of 47 | 23.4% |
+|tests/test_coder.py| 48 of 612 | 7.8% |
+|tests/test_commands.py| 135 of 588 | 23.0% |
+|tests/test_editblock.py| 23 of 403 | 5.7% |
+|tests/test_io.py| 30 of 65 | 46.2% |
+|tests/test_main.py| 13 of 239 | 5.4% |
+|tests/test_models.py| 6 of 28 | 21.4% |
+|tests/test_repo.py| 2 of 296 | 0.7% |
+|tests/test_repomap.py| 70 of 217 | 32.3% |
+|tests/test_udiff.py| 7 of 119 | 5.9% |
+|tests/test_wholefile.py| 37 of 321 | 11.5% |
+| **Total** | **1022 of 14219** | 7.2% |
+
+
--- a/website/_posts/2024-06-02-main-swe-bench.md
+++ b/website/_posts/2024-06-02-main-swe-bench.md
@ -0,0 +1,264 @@
+---
+title: Aider is SOTA for both SWE Bench and SWE Bench Lite
+excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
+highlight_image: /assets/swe_bench.jpg
+nav_exclude: true
+---
+
+# Aider is SOTA for both SWE Bench and SWE Bench Lite
+ 
+Aider scored 18.9%
+on the main
+[SWE Bench benchmark](https://www.swebench.com),
+achieving a state-of-the-art result. 
+The current top leaderboard entry is 13.8%
+from Amazon Q Developer Agent.
+The best result reported elsewhere seems to be
+[13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
+
+This result on the main SWE Bench builds on
+[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
+
+[![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
+
+**All of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.**
+Aider was benchmarked on the same
+[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
+that were used in the
+[Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
+See the [references](#references)
+for more details on the data presented in this chart.
+
+## Interactive, not agentic
+
+Aider achieved this result mainly through its existing features that focus on static
+code analysis, reliable LLM code editing, and pragmatic UX for automatically
+fixing linting and testing errors.
+Aider intentionally has quite limited and narrow "agentic behavior"
+to avoid long delays, high token costs
+and the need for users to repeatedly code review incorrect solutions.
+It's also worth noting that aider currently does not use
+RAG, vector search, tools or give the LLM access to search the web
+or unilaterally execute code.
+
+Aider is first and foremost an interactive tool for engineers to get real work done in
+real code bases using a chat interface.
+Aider provides a pair programming UX where users can ask for a change 
+and see code edits performed in real-time.
+Aider can also offer additional help like fixing lint or test errors,
+but the user is always in full interactive control.
+This allows them to quickly steer misunderstandings back on course and
+avoid wasting time and token costs.
+
+
+## Benchmark methodology
+
+Benchmarking was conducted as follows:
+
+- Aider with GPT-4o was launched in each problem's git repository
+with the problem statement
+submitted as the opening chat message from "the user".
+- After that aider ran as normal, except all of aider's
+suggestions were always accepted without user approval.
+- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
+Plausibly correct means that aider reported that it had successfully edited the repo
+without causing syntax errors or breaking any *pre-existing* tests.
+- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
+- If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems.
+
+It's important to be clear that
+*aider and the benchmark harness
+only had access to the pre-existing tests in each problem's repo*.
+The held out "acceptance tests" were *only* used
+after benchmarking to compute statistics on which problems aider
+correctly resolved.
+
+This is the same approach
+that was used for
+[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
+For the Lite benchmark,
+aider alternated between GPT-4o and Opus for up to six total attempts.
+To manage the cost of running the main SWE Bench benchmark,
+aider was limited to two total attempts:
+one with GPT-4o and one with Opus.
+
+For a detailed discussion of the benchmark
+methodology, see the
+[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
+Also, the
+[aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
+contains the harness and statistics code used for the benchmarks.
+
+The benchmarking process was similar to how a developer might use aider to
+resolve a GitHub issue:
+
+- They could launch aider in their repo with the command below, which
+tells aider they want to accept every suggestion
+and to use pytest to run tests.
+  - `aider --yes --test-cmd pytest`
+- They could start the chat by pasting in the URL or text of a GitHub issue.
+Aider will pull in the URL's content and then try and resolve the issue.
+- If aider doesn't produce code that lints and tests clean, the user might decide to
+[use git to revert the changes](https://aider.chat/docs/faq.html#how-does-aider-use-git),
+and try again with `aider --opus`.
+
+## Aider with GPT-4o alone was SOTA
+
+Using aider with GPT-4o to make a single attempt at resolving each problem
+achieved a score of 17.0%.
+This was itself a state-of-the-art result, before being surpassed by the main
+result being reported here
+that used aider with both GPT-4o & Opus.
+
+## Aider with GPT-4o & Opus
+
+The benchmark harness started by using aider with GPT-4o to try
+and resolve each problem.
+For problems where this didn't produce a plausible solution,
+the harness tried again using aider with Opus.
+So at most, two attempts were made for each problem.
+
+The table below breaks down the proposed solutions that
+were found from each attempt at the 570 problems.
+A proposed solution is either:
+
+- A plausible solution where
+aider reported no outstanding errors from editing, linting and testing.
+- Or, the "most plausible" solution generated by either attempt, with the
+[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
+
+The table also provides details on the 108 solutions that were ultimately
+verified as correctly resolving their issue.
+
+| Attempt | Agent |Number&nbsp;of<br>proposed<br>solutions|Percent&nbsp;of<br>proposed<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
+|:--------:|------------|---------:|---------:|----:|---:|--:|
+| 1 | Aider with GPT-4o    | 419 | 73.5% | 87 | 80.6% | 15.3% |
+| 2 | Aider with Opus      | 151 | 26.5% | 21 | 19.4% |  3.7% |
+| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
+
+## Non-plausible but correct solutions?
+
+A solution doesn't actually have to be plausible in order to correctly resolve the issue.
+Recall that plausible is simply defined as aider
+reporting that it successfully completed all file edits,
+repaired and resolved any linting errors
+and resolved any test failures.
+But there are many reasons why aider might fail to do those things
+and yet still produce a solution that will pass
+acceptance testing:
+
+- There may have been pre-existing failing tests in the repo,
+before aider even started working on the SWE Bench problem.
+Aider may not have resolved such issues, and yet they may not be
+relevant to the acceptance testing.
+The SWE Bench acceptance testing just confirms that tests pass or fail
+in the same pattern as the "gold patch" developed by a human to resolve the
+problem.
+Some tests may fail during acceptance testing,
+and that's ok as long as they failed for the gold
+patch too.
+- There may have been pre-existing linting problems in the repo.
+If lingering linting issues affected code paths that are not well tested,
+they may not impact acceptance testing.
+- Aider may have reported file editing errors because it thought the LLM
+specified edits that it wasn't able to successfully apply.
+This can only happen when the LLM specified edits in
+a way that doesn't comply with the editing instructions in the system prompt.
+Given that the LLM isn't complying with the system prompt,
+it may have become confused and
+asked for redundant or otherwise irrelevant edits.
+Such outstanding edit errors might not be fatal for acceptance testing.
+- Etc.
+
+Keeping all this in mind, we can understand why
+GPT-4o accounts for 15.3% of the benchmark score in the table above,
+but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
+When an Opus attempt is allowed after GPT-4o,
+it may propose some *incorrect* solutions which
+are "more plausible" than some of GPT-4o's non-plausible solutions.
+These more plausible, incorrect solutions can
+eclipse some of
+the earlier non-plausible correct solutions that GPT-4o generated.
+This is why GPT-4o's score in the table 
+showing the combined GPT-4o & Opus results (15.3%)
+is lower than the result from just one try using aider with GPT-4o (17.0%).
+
+For these reasons, adding additional attempts is not guaranteed to monotonically
+increase the number of resolved problems.
+New solutions may resolve some new problems but they may also
+eclipse and discard some of the previous non-plausible correct solutions.
+
+Luckily, the net effect of additional attempts
+usually increases or at least maintains the
+number of resolved solutions.
+This was the case for all the attempts made in both this main SWE Bench result and the
+earlier Lite result.
+
+## Computing the benchmark score
+
+The benchmark harness produced one proposed solution for each of
+the 570 SWE Bench problems.
+
+A separate evaluation script was used to
+test each of these solutions with the full test suite,
+including the held out acceptance tests.
+For this final acceptance testing, any edits that aider made to tests
+were discarded.
+This ensured that the correct,
+unmodified test suite was used for acceptance testing.
+The evaluation script compared each proposed solution's test results
+with results from testing
+the "gold" patch that was developed by a human to correctly resolve the issue.
+If they matched, the proposed solution correctly resolved the issue.
+
+These acceptance tests were only ever run outside of aider
+and the benchmark harness, and only to compute statistics about the
+correctly resolved instances.
+They were never run, used, or even visible during aider's attempts to resolve the problems.
+
+Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
+or 18.9%.
+
+## Acknowledgments
+
+Much thanks to the team behind the
+[SWE Bench](https://www.swebench.com)
+family of AI coding benchmarks.
+Also thanks to Albert Örwall who has
+[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
+making it faster, easier, and more reliable to run the acceptance tests.
+
+
+## References
+
+All of aider's results reported here are pass@1 results,
+obtained without using the SWE Bench `hints_text`.
+
+The "aider agent" internally makes multiple "attempts" at solving the problem,
+but it picks and returns one single candidate solution.
+Only that one candidate solution is evaluated with the acceptance tests
+and contributes to the benchmark score.
+Thus it is a pass@1 result.
+
+This is contrast to a pass@N result for N>1, where N attempts are made
+and all N solutions are evaluated by the acceptance tests.
+If *any* of the N solution pass, that counts as a pass@N success.
+
+Below are the references for the other pass@1 unhinted SWE-Bench results
+displayed in the graph at the beginning of this article.
+
+- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
+- [13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.](https://www.swebench.com)
+- [12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.](https://www.swebench.com)
+- [10.6% AutoCode Rover, benchmarked on 2,294 instances.](https://arxiv.org/pdf/2404.05427v2)
+- [10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.](https://www.swebench.com)
+
+The graph contains average pass@1 results for AutoCodeRover.
+The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
+features their pass@3 results
+without being clearly labeled.
+Table 2 of their
+[paper](https://arxiv.org/pdf/2404.05427v2)
+reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
+