mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
move into website/
This commit is contained in:
parent
5a4d38418d
commit
56519361e2
103 changed files with 5 additions and 12 deletions
1
website/_posts/2023-05-25-ctags.md
Symbolic link
1
website/_posts/2023-05-25-ctags.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/ctags.md
|
1
website/_posts/2023-07-02-benchmarks.md
Symbolic link
1
website/_posts/2023-07-02-benchmarks.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks.md
|
1
website/_posts/2023-10-22-repomap.md
Symbolic link
1
website/_posts/2023-10-22-repomap.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/repomap.md
|
1
website/_posts/2023-11-06-benchmarks-1106.md
Symbolic link
1
website/_posts/2023-11-06-benchmarks-1106.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-1106.md
|
1
website/_posts/2023-11-06-benchmarks-speed-1106.md
Symbolic link
1
website/_posts/2023-11-06-benchmarks-speed-1106.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-speed-1106.md
|
1
website/_posts/2023-12-21-unified-diffs.md
Symbolic link
1
website/_posts/2023-12-21-unified-diffs.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/unified-diffs.md
|
1
website/_posts/2024-01-25-benchmarks-0125.md
Symbolic link
1
website/_posts/2024-01-25-benchmarks-0125.md
Symbolic link
|
@ -0,0 +1 @@
|
|||
../docs/benchmarks-0125.md
|
89
website/_posts/2024-03-08-claude-3.md
Normal file
89
website/_posts/2024-03-08-claude-3.md
Normal file
|
@ -0,0 +1,89 @@
|
|||
---
|
||||
title: Claude 3 beats GPT-4 on Aider's code editing benchmark
|
||||
excerpt: Claude 3 Opus outperforms all of OpenAI's models on Aider's code editing benchmark, making it the best available model for pair programming with AI.
|
||||
highlight_image: /assets/2024-03-07-claude-3.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
# Claude 3 beats GPT-4 on Aider's code editing benchmark
|
||||
|
||||
[](https://aider.chat/assets/2024-03-07-claude-3.svg)
|
||||
|
||||
[Anthropic just released their new Claude 3 models](https://www.anthropic.com/news/claude-3-family)
|
||||
with evals showing better performance on coding tasks.
|
||||
With that in mind, I've been benchmarking the new models
|
||||
using Aider's code editing benchmark suite.
|
||||
|
||||
Claude 3 Opus outperforms all of OpenAI's models,
|
||||
making it the best available model for pair programming with AI.
|
||||
|
||||
To use Claude 3 Opus with aider:
|
||||
|
||||
```
|
||||
pip install aider-chat
|
||||
export ANTHROPIC_API_KEY=sk-...
|
||||
aider --opus
|
||||
```
|
||||
|
||||
## Aider's code editing benchmark
|
||||
|
||||
[Aider](https://github.com/paul-gauthier/aider)
|
||||
is an open source command line chat tool that lets you
|
||||
pair program with AI on code in your local git repo.
|
||||
|
||||
Aider relies on a
|
||||
[code editing benchmark](https://aider.chat/docs/benchmarks.html)
|
||||
to quantitatively evaluate how well
|
||||
an LLM can make changes to existing code.
|
||||
The benchmark uses aider to try and complete
|
||||
[133 Exercism Python coding exercises](https://github.com/exercism/python).
|
||||
For each exercise,
|
||||
Exercism provides a starting python file with stubs for the needed functions,
|
||||
a natural language description of the problem to solve
|
||||
and a test suite to evaluate whether the coder has correctly solved the problem.
|
||||
|
||||
The LLM gets two tries to solve each problem:
|
||||
|
||||
1. On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
|
||||
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
|
||||
|
||||
## Benchmark results
|
||||
|
||||
### Claude 3 Opus
|
||||
|
||||
- The new `claude-3-opus-20240229` model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
|
||||
- Its single-try performance was comparable to the latest GPT-4 Turbo model `gpt-4-0125-preview`, at 54.1%.
|
||||
- While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
|
||||
|
||||
### Claude 3 Sonnet
|
||||
|
||||
- The new `claude-3-sonnet-20240229` model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.
|
||||
|
||||
## Code editing
|
||||
|
||||
It's highly desirable to have the LLM send back code edits as
|
||||
some form of diffs, rather than having it send back an updated copy of the
|
||||
entire source code.
|
||||
|
||||
Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back
|
||||
updated copies of entire source files.
|
||||
Aider uses more efficient
|
||||
[search/replace blocks](https://aider.chat/2023/07/02/benchmarks.html#diff)
|
||||
with the original GPT-4
|
||||
and
|
||||
[unified diffs](https://aider.chat/2023/12/21/unified-diffs.html#unified-diff-editing-format)
|
||||
with the newer GPT-4 Turbo models.
|
||||
|
||||
Claude 3 Opus works best with the search/replace blocks, allowing it to send back
|
||||
code changes efficiently.
|
||||
Unfortunately, the Sonnet model was only able to work reliably with whole files,
|
||||
which limits it to editing smaller source files and uses more tokens, money and time.
|
||||
|
||||
## Other observations
|
||||
|
||||
There are a few other things worth noting:
|
||||
|
||||
- Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI's models. You can get almost the same coding skill faster and cheaper with OpenAI's models.
|
||||
- Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
|
||||
- The Claude models refused to perform a number of coding tasks and returned the error "Output blocked by content filtering policy". They refused to code up the [beer song](https://exercism.org/tracks/python/exercises/beer-song) program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
|
||||
- The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it's a sign that Anthropic made be struggling under surging demand.
|
||||
|
70
website/_posts/2024-04-09-gpt-4-turbo.md
Normal file
70
website/_posts/2024-04-09-gpt-4-turbo.md
Normal file
|
@ -0,0 +1,70 @@
|
|||
---
|
||||
title: GPT-4 Turbo with Vision is a step backwards for coding
|
||||
excerpt: OpenAI's GPT-4 Turbo with Vision model scores worse on aider's code editing benchmarks than all the previous GPT-4 models. In particular, it seems much more prone to "lazy coding" than the existing GPT-4 Turbo "preview" models.
|
||||
highlight_image: /assets/2024-04-09-gpt-4-turbo-laziness.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
# GPT-4 Turbo with Vision is a step backwards for coding
|
||||
|
||||
[OpenAI just released GPT-4 Turbo with Vision](https://twitter.com/OpenAIDevs/status/1777769463258988634)
|
||||
and it performs worse on aider's coding benchmark suites than all the previous GPT-4 models.
|
||||
In particular, it seems much more prone to "lazy coding" than the
|
||||
existing GPT-4 Turbo "preview" models.
|
||||
|
||||
## Code editing skill
|
||||
|
||||
[](https://aider.chat/assets/2024-04-09-gpt-4-turbo.svg)
|
||||
|
||||
Aider relies on a
|
||||
[code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark)
|
||||
to quantitatively evaluate how well
|
||||
an LLM can make changes to existing code.
|
||||
The benchmark uses aider to try and complete
|
||||
[133 Exercism Python coding exercises](https://github.com/exercism/python).
|
||||
|
||||
For each exercise, the LLM gets two tries to solve each problem:
|
||||
|
||||
1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
|
||||
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
|
||||
|
||||
**GPT-4 Turbo with Vision
|
||||
scores only 62% on this benchmark,
|
||||
the lowest score of any of the existing GPT-4 models.**
|
||||
The other models scored 63-66%, so this represents only a small
|
||||
regression, and is likely statistically insignificant when compared
|
||||
against `gpt-4-0613`.
|
||||
|
||||
## Lazy coding
|
||||
|
||||
[](https://aider.chat/assets/2024-04-09-gpt-4-turbo-laziness.svg)
|
||||
|
||||
The GPT-4 Turbo "preview" models have been widely criticized for being "lazy"
|
||||
when coding.
|
||||
They often omit needed code
|
||||
and instead leave comments with homework assignments like "implement method here".
|
||||
|
||||
```
|
||||
def some_complex_method(foo, bar):
|
||||
# ... implement method here ...
|
||||
```
|
||||
|
||||
Aider uses a ["laziness" benchmark suite](https://github.com/paul-gauthier/refactor-benchmark)
|
||||
which is designed to both provoke and quantify lazy coding.
|
||||
It consists of
|
||||
89 python refactoring tasks
|
||||
which tend to make GPT-4 Turbo code in that lazy manner.
|
||||
|
||||
**The new GPT-4 Turbo with Vision model scores only 34% on aider's
|
||||
refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
|
||||
by a significant margin.**
|
||||
|
||||
# Conclusions
|
||||
|
||||
Aider has full support for the new GPT-4 Turbo with Vision
|
||||
model, which you can access using the switch `--model gpt-4-turbo-2024-04-09`.
|
||||
But aider will continue to use `gpt-4-1106-preview` by default,
|
||||
as it is by far the strongest coder of the GPT-4 models.
|
||||
|
||||
|
||||
|
||||
|
52
website/_posts/2024-05-02-browser.md
Normal file
52
website/_posts/2024-05-02-browser.md
Normal file
|
@ -0,0 +1,52 @@
|
|||
---
|
||||
title: Aider in your browser
|
||||
excerpt: Aider has an experimental browser UI, allowing you to collaborate with LLMs on code in your local git repo.
|
||||
highlight_image: /assets/browser.jpg
|
||||
nav_order: 800
|
||||
---
|
||||
# Aider in your browser
|
||||
|
||||
<div class="video-container">
|
||||
<video controls loop poster="/assets/browser.jpg">
|
||||
<source src="/assets/aider-browser-social.mp4" type="video/mp4">
|
||||
<a href="/assets/aider-browser-social.mp4">Aider browser UI demo video</a>
|
||||
</video>
|
||||
</div>
|
||||
|
||||
<style>
|
||||
.video-container {
|
||||
position: relative;
|
||||
padding-bottom: 101.89%; /* 1080 / 1060 = 1.0189 */
|
||||
height: 0;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.video-container video {
|
||||
position: absolute;
|
||||
top: 0;
|
||||
left: 0;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}
|
||||
</style>
|
||||
|
||||
Use aider's new experimental browser UI to collaborate with LLMs
|
||||
to edit code in your local git repo.
|
||||
Aider will directly edit the code in your local source files,
|
||||
and [git commit the changes](https://aider.chat/docs/faq.html#how-does-aider-use-git)
|
||||
with sensible commit messages.
|
||||
You can start a new project or work with an existing git repo.
|
||||
Aider works well with GPT 3.5, GPT-4, GPT-4 Turbo with Vision,
|
||||
and Claude 3 Opus.
|
||||
It also supports [connecting to almost any LLM](https://aider.chat/docs/llms.html).
|
||||
|
||||
Use the `--browser` switch to launch the browser version of aider:
|
||||
|
||||
```
|
||||
pip install aider-chat
|
||||
|
||||
export OPENAI_API_KEY=<key> # Mac/Linux
|
||||
setx OPENAI_API_KEY <key> # Windows
|
||||
|
||||
aider --browser
|
||||
```
|
324
website/_posts/2024-05-13-models-over-time.md
Normal file
324
website/_posts/2024-05-13-models-over-time.md
Normal file
|
@ -0,0 +1,324 @@
|
|||
---
|
||||
title: Drawing graphs with aider, GPT-4o and matplotlib
|
||||
excerpt: Use GPT-4o to draw graphs with matplotlib, including adjusting styles and making visual changes. You get the graph, but you also get the code in your repo.
|
||||
highlight_image: /assets/models-over-time.png
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||

|
||||
|
||||
# Drawing graphs with aider, GPT-4o and matplotlib
|
||||
|
||||
I recently wanted to draw a graph showing how LLM code editing skill has been
|
||||
changing over time as new models have been released by OpenAI, Anthropic and others.
|
||||
I have all the
|
||||
[data in a yaml file](https://github.com/paul-gauthier/aider/blob/main/_data/edit_leaderboard.yml) that is used to render
|
||||
[aider's LLM leaderboards](https://aider.chat/docs/leaderboards/).
|
||||
|
||||
Below is the aider chat transcript, which shows:
|
||||
|
||||
- I launch aider with the yaml file, a file with other plots I've done recently (so GPT can crib the style) and an empty file called `over_time.py`.
|
||||
- Then I ask GPT to draw the scatterplot I want.
|
||||
- I run the resulting script and share the error output with GPT so it can fix a small bug.
|
||||
- I ask it to color the points for GPT-4 and GPT-3.5 family models differently, to better see trends within those model families.
|
||||
- I add a PNG of the graph to the chat, so GPT can see the current graph and understand why I want it to move the point labels so they don't overlap the points.
|
||||
- I work through a series of other small style changes, like changing fonts and the graph border.
|
||||
|
||||
In the end I have the graph, but I also have the python code in my repo.
|
||||
So I can update this graph easily whenever I add new entries to the yaml data file.
|
||||
|
||||
|
||||
## Aider chat transcript
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
> Aider v0.35.1-dev
|
||||
> Models: gpt-4o with diff edit format, weak model gpt-3.5-turbo
|
||||
> Added _data/edit_leaderboard.yml to the chat.
|
||||
> Added benchmark/over_time.py to the chat.
|
||||
> Added benchmark/plots.py to the chat.
|
||||
|
||||
#### in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.
|
||||
|
||||
To draw a scatter plot for each of the YAML entries with a `released` date, we need to:
|
||||
|
||||
1. Read the YAML file.
|
||||
2. Extract the `released` date and `pass_rate_2` for each entry.
|
||||
3. Plot the data using matplotlib.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
=======
|
||||
import matplotlib.pyplot as plt
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
|
||||
def plot_over_time(yaml_file):
|
||||
with open(yaml_file, 'r') as file:
|
||||
data = yaml.safe_load(file)
|
||||
|
||||
dates = []
|
||||
pass_rates = []
|
||||
models = []
|
||||
|
||||
for entry in data:
|
||||
if 'released' in entry and 'pass_rate_2' in entry:
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
pass_rates.append(entry['pass_rate_2'])
|
||||
models.append(entry['model'])
|
||||
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
|
||||
|
||||
for i, model in enumerate(models):
|
||||
plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
plt.xlabel('Release Date')
|
||||
plt.ylabel('Pass Rate 2')
|
||||
plt.title('Model Performance Over Time')
|
||||
plt.grid(True)
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Example usage
|
||||
plot_over_time('_data/edit_leaderboard.yml')
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 49ac9ca aider: Added scatter plot functionality for YAML entries with a `released` date and `pass_rate_2`.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### /run python benchmark/over_time.py
|
||||
> Traceback (most recent call last):
|
||||
File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
|
||||
plot_over_time('_data/edit_leaderboard.yml')
|
||||
File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
TypeError: strptime() argument 1 must be str, not datetime.date
|
||||
> Add the output to the chat? y
|
||||
> Traceback (most recent call last):
|
||||
> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
|
||||
> plot_over_time('_data/edit_leaderboard.yml')
|
||||
> File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
|
||||
> dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
> TypeError: strptime() argument 1 must be str, not datetime.date
|
||||
|
||||
The error indicates that the `released` field in the YAML file is already a `datetime.date` object, so we don't need to parse it with `datetime.strptime`. We can directly use it in the plot.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
|
||||
=======
|
||||
dates.append(entry['released'])
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit af2f4b7 aider: Use the `released` field directly in the plot instead of parsing it with `datetime.strptime`.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### color the "gpt-4" models differently
|
||||
|
||||
To color the "gpt-4" models differently, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-4" in the model name and assign a different color.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
|
||||
=======
|
||||
colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
|
||||
plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
This change will color the points red if the model name contains "gpt-4", otherwise, it will color them blue.
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 0d11357 aider: Color the "gpt-4" models differently.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### color the gpt-3.5 models a new color
|
||||
|
||||
To color the "gpt-3.5" models a new color, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-3.5" in the model name and assign a different color.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
|
||||
=======
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 85d9d84 aider: Color the "gpt-3.5" models green in the plotting code.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### use the colorscheme and fonts from the other graphs in plots.py
|
||||
|
||||
To use the colorscheme and fonts from the graphs in `plots.py`, we need to:
|
||||
|
||||
1. Import the necessary settings from `plots.py`.
|
||||
2. Apply these settings to the graph in `over_time.py`.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
=======
|
||||
import yaml
|
||||
from datetime import datetime
|
||||
from matplotlib import rc
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.figure(figsize=(10, 6))
|
||||
=======
|
||||
plt.rcParams["hatch.linewidth"] = 0.5
|
||||
plt.rcParams["hatch.color"] = "#444444"
|
||||
|
||||
rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
|
||||
|
||||
fig, ax = plt.subplots(figsize=(6, 4))
|
||||
ax.grid(axis="y", zorder=0, lw=0.2)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
=======
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
plt.xlabel('Release Date')
|
||||
plt.ylabel('Pass Rate 2')
|
||||
plt.title('Model Performance Over Time')
|
||||
plt.grid(True)
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
=======
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
|
||||
ax.set_xlabel('Release Date')
|
||||
ax.set_ylabel('Pass Rate 2')
|
||||
ax.set_title('Model Performance Over Time')
|
||||
plt.tight_layout()
|
||||
plt.savefig("tmp_over_time.svg")
|
||||
imgcat(fig)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 8d670f1 aider: Used the colorscheme and fonts from the graphs in `plots.py` in `over_time.py`.
|
||||
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### /add tmp_over_time.png
|
||||
> Added /Users/gauthier/Projects/aider/tmp_over_time.png to the chat
|
||||
|
||||
#### make the labels not overlap the dots
|
||||
|
||||
To make the labels not overlap the dots, we can adjust the position of the labels slightly using the `xytext` parameter in the `annotate` function.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)
|
||||
=======
|
||||
ax.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75,
|
||||
xytext=(5, 5), textcoords='offset points')
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Committing benchmark/over_time.py before applying edits.
|
||||
> Commit 49eb72c Updated image file format from SVG to PNG and added imgcat library for displaying the plot.
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 75fa061 aider: Adjust the position of labels to prevent overlap with dots.
|
||||
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### make the dots bigger
|
||||
|
||||
To make the dots bigger, we can adjust the `s` parameter in the `scatter` function, which controls the size of the dots.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5)
|
||||
=======
|
||||
ax.scatter(dates, pass_rates, c=colors, alpha=0.5, s=100)
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit a018fdf aider: Make the dots bigger.
|
||||
|
||||
</div>
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### make the border of the graph thinner and lighter
|
||||
|
||||
To make the border of the graph thinner and lighter, we can adjust the `spines` properties of the axes.
|
||||
|
||||
Here are the changes:
|
||||
|
||||
benchmark/over_time.py
|
||||
```python
|
||||
<<<<<<< SEARCH
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
=======
|
||||
for spine in ax.spines.values():
|
||||
spine.set_edgecolor('#DDDDDD')
|
||||
spine.set_linewidth(0.5)
|
||||
|
||||
colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
|
||||
>>>>>>> REPLACE
|
||||
```
|
||||
|
||||
> Applied edit to benchmark/over_time.py
|
||||
> Commit 77ba518 aider: Made the border of the graph thinner and lighter by adjusting the `spines` properties of the axes.
|
||||
|
||||
</div>
|
11
website/_posts/2024-05-22-draft.md
Normal file
11
website/_posts/2024-05-22-draft.md
Normal file
|
@ -0,0 +1,11 @@
|
|||
---
|
||||
title: A draft post.
|
||||
excerpt: With a draft summary.
|
||||
highlight_image: /assets/linting.jpg
|
||||
draft: true
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
# A draft post
|
||||
|
||||
Content TBD.
|
146
website/_posts/2024-05-22-linting.md
Normal file
146
website/_posts/2024-05-22-linting.md
Normal file
|
@ -0,0 +1,146 @@
|
|||
---
|
||||
title: Linting code for LLMs with tree-sitter
|
||||
excerpt: Aider now lints code after every LLM edit and automatically fixes errors, using tree-sitter and AST-aware code context.
|
||||
highlight_image: /assets/linting.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
[](https://aider.chat/assets/linting.jpg)
|
||||
|
||||
# Linting code for LLMs with tree-sitter
|
||||
|
||||
Aider now lints your code after every LLM edit, and offers to automatically fix
|
||||
any linting errors.
|
||||
You can also use aider's lint-and-fix functionality on your source files any time
|
||||
you like, to speedily resolve issues with code written by humans.
|
||||
|
||||
Aider shows linting errors to the LLM in a novel format,
|
||||
using tree-sitter
|
||||
to help display relevant code context for each
|
||||
error.
|
||||
This increases the ability of the LLM to understand the problem and
|
||||
make the correct changes to resolve it.
|
||||
|
||||
Aider ships with basic linters built with tree-sitter that support
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
These built in linters will detect syntax errors and other fatal problems with the code.
|
||||
|
||||
You can also configure aider to use your preferred linters.
|
||||
This allows aider to check for a larger class of problems, keep the code style
|
||||
aligned with the rest of your team, etc.
|
||||
|
||||
## Linting and fixing your code
|
||||
|
||||
Aider now lints each source file after it applies the edits
|
||||
suggested by an LLM.
|
||||
If problems are found, aider will ask if you'd like it to
|
||||
attempt to fix the errors.
|
||||
If so, aider will send the LLM a report of the lint errors
|
||||
and request changes to fix them. This process may iterate a few times
|
||||
as the LLM works to fully resolve all the issues.
|
||||
|
||||
You can also lint and fix files any time, on demand from within the aider chat or via the
|
||||
command line:
|
||||
|
||||
- The in-chat `/lint` command will lint and fix all the files which have
|
||||
been added to the chat by default. Or you can name any files
|
||||
in your git repo as arguments.
|
||||
- From the command line, you can run `aider --lint` to lint and fix
|
||||
all the dirty files in the repo.
|
||||
Or you can specify specific filenames on the command line.
|
||||
|
||||
|
||||
## An LLM-friendly lint report
|
||||
|
||||
Most linting tools produce terse and cryptic output,
|
||||
which is one reason many engineers appreciate IDEs that highlight
|
||||
linting errors.
|
||||
LLM's don't have the luxury of using an IDE, so aider sends
|
||||
the linting errors in an LLM friendly format.
|
||||
|
||||
Here's an example of raw output of the `flake8` python linter:
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
app.py:41:16: F541 f-string is missing placeholders
|
||||
```
|
||||
|
||||
This sort of output depends on the user to reference line numbers to find and fix
|
||||
each reported error.
|
||||
LLMs are quite bad at working with source code line numbers, often
|
||||
making off-by-one errors and other mistakes even when provided with
|
||||
a fully numbered code listing.
|
||||
|
||||
Aider augments the raw linter by
|
||||
displaying and
|
||||
highlighting the lines that have errors within their
|
||||
containing functions, methods, classes.
|
||||
To do this, aider uses tree-sitter to obtain the code's AST and analyzes it
|
||||
in light of the linting errors.
|
||||
LLMs are more effective at editing code that's provided
|
||||
with context like this.
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
app.py:41:16: F541 f-string is missing placeholders
|
||||
|
||||
app.py:
|
||||
...⋮...
|
||||
6│class LongNum:
|
||||
7│ def __init__(self, num):
|
||||
8│ """
|
||||
9│ Initialize the number.
|
||||
10│ """
|
||||
...⋮...
|
||||
19│ def __str__(self):
|
||||
20│ """
|
||||
21│ Render the number as a string.
|
||||
22│ """
|
||||
23█ return str(num)
|
||||
24│
|
||||
25│
|
||||
26│@app.route('/subtract/<int:x>/<int:y>')
|
||||
...⋮...
|
||||
38│@app.route('/divide/<int:x>/<int:y>')
|
||||
39│def divide(x, y):
|
||||
40│ if y == 0:
|
||||
41█ return f"Error: Cannot divide by zero"
|
||||
42│ else:
|
||||
43│ result = x / y
|
||||
44│ return str(result)
|
||||
45│
|
||||
...⋮...
|
||||
```
|
||||
|
||||
## Basic linters for most popular languages
|
||||
|
||||
Aider comes batteries-included with built in linters for
|
||||
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
|
||||
This provides wide support for linting without requiring
|
||||
users to manually install a linter and configure it to work with aider.
|
||||
|
||||
Aider's built in language-agnostic linter uses tree-sitter to parse
|
||||
the AST of each file.
|
||||
When tree-sitter encounters a syntax error or other fatal issue
|
||||
parsing a source file, it inserts an AST node with type `ERROR`.
|
||||
Aider simply uses these `ERROR` nodes to identify all the lines
|
||||
with syntax or other types of fatal error, and displays
|
||||
them in the LLM friendly format described above.
|
||||
|
||||
## Configuring your preferred linters
|
||||
|
||||
You can optionally configure aider to use
|
||||
your preferred linters with the `--lint-cmd` switch.
|
||||
|
||||
```
|
||||
# To lint javascript with jslint
|
||||
aider --lint-cmd javascript:jslint
|
||||
|
||||
# To lint python with flake8 using some specific args:
|
||||
aider --lint-cmd "python:flake8 --select=E9,F821,F823..."
|
||||
```
|
||||
|
||||
You can provide multiple `--lint-cmd` switches
|
||||
to set linters for various languages.
|
||||
You can also durably set linters in your `.aider.conf.yml` file.
|
||||
|
451
website/_posts/2024-05-22-swe-bench-lite.md
Normal file
451
website/_posts/2024-05-22-swe-bench-lite.md
Normal file
|
@ -0,0 +1,451 @@
|
|||
---
|
||||
title: How aider scored SOTA 26.3% on SWE Bench Lite
|
||||
excerpt: Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||
highlight_image: /assets/swe_bench_lite.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
# How aider scored SOTA 26.3% on SWE Bench Lite
|
||||
|
||||
[Aider scored 26.3%](https://github.com/swe-bench/experiments/pull/7)
|
||||
on the
|
||||
[SWE Bench Lite benchmark](https://www.swebench.com),
|
||||
achieving a state-of-the-art result.
|
||||
The previous top leaderboard entry was 20.3%
|
||||
from Amazon Q Developer Agent.
|
||||
|
||||
See also [aider's SOTA result on the main SWE Bench](https://aider.chat/2024/06/02/main-swe-bench.html).
|
||||
|
||||
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||
|
||||
**All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.**
|
||||
All results in the above chart are unhinted pass@1 results.
|
||||
Please see the [references](#references)
|
||||
for details on the data presented in this chart.
|
||||
It was corrected on 5/30/24 to reflect apples-to-apples comparisons,
|
||||
using pass@1 results from AutoCodeRover
|
||||
and results from OpenDevin that don't use hints.
|
||||
The [official SWE Bench Lite leaderboard](https://www.swebench.com)
|
||||
only accepts pass@1 results that do not use hints.
|
||||
|
||||
## Interactive, not agentic
|
||||
|
||||
Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.
|
||||
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||
to avoid long delays, high token costs
|
||||
and the need for users to repeatedly code review incorrect solutions.
|
||||
It's also worth noting that aider currently does not use
|
||||
RAG, vector search, tools or give the LLM access to search the web
|
||||
or unilaterally execute code.
|
||||
|
||||
Aider is first and foremost an interactive tool for engineers to get real work done in
|
||||
real code bases using a chat interface.
|
||||
Aider provides a pair programming UX where users can ask for a change
|
||||
and see the edits performed in real-time.
|
||||
Aider can also offer additional help like fixing lint or test errors,
|
||||
but the user is always in full interactive control.
|
||||
This lets them quickly steer misunderstandings back on course and
|
||||
avoid wasting time and token costs.
|
||||
|
||||
|
||||
## Benchmark methodology
|
||||
|
||||
For the benchmark,
|
||||
aider was launched in each problem's git repository
|
||||
with the problem statement
|
||||
submitted as the opening chat message from "the user."
|
||||
After that aider runs as normal, with the following modifications:
|
||||
|
||||
- Aider's suggestions were always accepted without user approval.
|
||||
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||
without causing syntax errors or breaking any *pre-existing* tests.
|
||||
- If the solution isn't plausible, the harness launches aider to try again from scratch,
|
||||
alternating between using aider with GPT-4o and Opus.
|
||||
- If no plausible solution is found after six tries, the harness picks the solution
|
||||
with the fewest edit/lint/test problems.
|
||||
|
||||
It's important to be clear that
|
||||
*aider and the benchmark harness
|
||||
only had access to the pre-existing tests in each problem's repo*.
|
||||
The held out "acceptance tests" were *only* used
|
||||
after benchmarking to compute statistics on which problems aider
|
||||
correctly resolved.
|
||||
|
||||
The [full harness to run aider on SWE Bench Lite is available on GitHub](https://github.com/paul-gauthier/aider-swe-bench).
|
||||
|
||||
The benchmarking process was similar to how a developer might use aider to
|
||||
resolve a GitHub issue:
|
||||
|
||||
- They could launch aider in their repo with the command below, which
|
||||
tells aider they want to accept every suggestion
|
||||
and to use pytest to run tests.
|
||||
- `aider --yes --test-cmd pytest`
|
||||
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||
Aider will pull in the URL's content and then try and solve the issue.
|
||||
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
|
||||
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||
so it's always easy to revert AI changes that don't pan out.
|
||||
|
||||
Outside a benchmark setting, it's probably
|
||||
unwise or at least highly inefficient
|
||||
to let *any* AI agent run unsupervised on your code base.
|
||||
The reason aider is intended to be used interactively
|
||||
is so that the user can participate and direct aider's work and approve suggestions.
|
||||
This way the user can offer immediate feedback or corrections if their initial
|
||||
instructions turn out to be ambiguous,
|
||||
or if the AI starts going down a wrong path.
|
||||
|
||||
## Aider with GPT-4o alone was SOTA
|
||||
|
||||
Running the benchmark harness
|
||||
only using aider with GPT-4o to find plausible solutions
|
||||
achieved a score of 25.0%.
|
||||
This was itself matching the state-of-the-art, before being surpassed by the main
|
||||
result being reported here
|
||||
that used aider with both GPT-4o & Opus.
|
||||
|
||||
As noted below, a single attempt using Aider with GPT-4o tied
|
||||
the current top entry on the leaderboard.
|
||||
|
||||
## Aider with GPT-4o & Opus
|
||||
|
||||
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||
The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||
then alternating with Opus until a plausible solution was found for each
|
||||
problem.
|
||||
|
||||
The table below breaks down the plausible solutions that
|
||||
were found for the 300 problems.
|
||||
It also provides details on the 79 that were ultimately
|
||||
verified as correctly resolving their issue.
|
||||
Some noteworthy observations:
|
||||
|
||||
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
||||
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark.
|
||||
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
||||
- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
||||
|
||||
|
||||
| Attempt | Agent |Number of<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite |
|
||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
|
||||
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
|
||||
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
|
||||
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
|
||||
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
|
||||
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
|
||||
| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
|
||||
|
||||
|
||||
If we break down the solutions solely by model,
|
||||
we can see that aider with GPT-4o outperforms Opus.
|
||||
This isn't a fair and direct comparison, because GPT-4o always took the first
|
||||
turn and therefore got first crack at all the "easiest" problems.
|
||||
Aider with Opus only ever saw problems that GPT-4o failed to
|
||||
find plausible solutions for on its first try.
|
||||
|
||||
Aider with GPT-4o was producing higher quality plausible solutions,
|
||||
with a greater chance of going on to be accepted as resolving the issue.
|
||||
Again, this is biased by the turn ordering.
|
||||
But other anecdotal evidence from earlier runs of the benchmark
|
||||
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||
for this benchmark.
|
||||
|
||||
|
||||
| Agent | Number of<br>plausible<br>solutions | Number of<br>correctly<br>resolved<br>solutions | Percent of<br>plausible<br>which<br>correctly<br>resolved<br>|
|
||||
|------------|---------:|---------:|---:|
|
||||
| Aider with GPT-4o | 239 | 66 |27.6% |
|
||||
| Aider with Opus | 61 | 13 |21.3% |
|
||||
| **Total** | **300** | **79** |**26.3%** |
|
||||
|
||||
## Repository map, not RAG
|
||||
|
||||
The crucial first step in solving a SWE Bench problem is figuring out
|
||||
which parts of the repo are relevant and which files need to be edited.
|
||||
Most coding agents use some combination of RAG, vector search
|
||||
and providing the LLM with
|
||||
tools to interactively explore the code base.
|
||||
|
||||
Aider instead uses a
|
||||
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
||||
to help the LLM understand the
|
||||
layout, code structure, and content of a git repo.
|
||||
The repo map is created through static analysis of the code's
|
||||
abstract syntax tree and call graph
|
||||
to provide a compact and powerful summary of the entire code base.
|
||||
The map is constantly
|
||||
tailored to show
|
||||
repo context that is relevant to the current state of the chat conversation.
|
||||
This is done by performing a graph optimization on the code's call graph.
|
||||
|
||||
When the user asks for a change to their code, the LLM can use the repo map
|
||||
to decide which files to edit.
|
||||
The LLM simply returns a normal text response explaining which files
|
||||
it needs to edit and why.
|
||||
Aider notices when the LLM mentions filenames from the repo,
|
||||
and asks the user if they should be added to the chat.
|
||||
Adding a file to the chat allows the LLM to see the full contents
|
||||
of the file and edit it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
#### Please add a new /factorial/N endpoint.
|
||||
|
||||
To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
|
||||
Please add app.py to the chat so I can proceed with the changes.
|
||||
|
||||
> app.py
|
||||
> Add these files to the chat? yes
|
||||
|
||||
</div>
|
||||
|
||||
This is a convenient and natural workflow for interactive chat,
|
||||
and it worked well for the SWE Bench problems.
|
||||
Aider successfully identified the correct file to edit
|
||||
in 70.3% of the benchmark tasks.
|
||||
|
||||
We can determine which file needs to be edited using the "gold" patch
|
||||
which is associated with each SWE Bench task.
|
||||
This patch was created by a human developer
|
||||
to solve the issue, and therefore reveals a file which can
|
||||
be edited to solve the problem.
|
||||
Of course aider is not able to see or use the gold patch
|
||||
or the file names it contains in any way.
|
||||
This information was only used to compute
|
||||
statistics outside the benchmarking process.
|
||||
|
||||
|
||||
## Reliable code editing
|
||||
|
||||
Once files have been selected for editing,
|
||||
the next step is of course to edit the source code to fix the problem.
|
||||
|
||||
Aider goes to great lengths to ensure that LLMs can not just write code,
|
||||
but reliably *edit* code.
|
||||
Aider has a collection of prompting strategies and code editing backends which have
|
||||
been honed through
|
||||
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
||||
These foundational capabilities help ensure that aider can
|
||||
properly integrate code from LLMs into an existing code base and source files.
|
||||
|
||||
The repository map helps here too, making sure that the LLM
|
||||
can see relevant classes, functions and variables from the entire repo.
|
||||
This helps ensure that the project's existing APIs and conventions are
|
||||
respected and utilized when new code is added.
|
||||
|
||||
Regardless, there are still cases where aider may be unable to cleanly
|
||||
complete the edits specified by the LLM.
|
||||
This is usually because the LLM has failed to conform to the editing
|
||||
instructions in its system prompt.
|
||||
When aider completes, it returns an editing outcome that indicates
|
||||
whether it was able to successfully apply all edits.
|
||||
The benchmark harness uses this editing status as
|
||||
one criteria to determine if aider has
|
||||
created a plausible solution.
|
||||
|
||||
## Linting and fixing
|
||||
|
||||
Another key criteria for a plausible solution is that it passes basic
|
||||
linting, which means that the code has no syntax
|
||||
or other fatal errors.
|
||||
[Aider lints code](https://aider.chat/2024/05/22/linting.html)
|
||||
after every LLM edit and offers to automatically fix
|
||||
any problems.
|
||||
|
||||
Aider ships with built-in linters based on tree-sitter
|
||||
which work with most popular programming languages.
|
||||
Aider shows linting errors to the LLM in a novel format,
|
||||
using the abstract syntax tree to display relevant code context for each
|
||||
error.
|
||||
This context helps LLMs understand the problem and
|
||||
make the correct changes to resolve it.
|
||||
|
||||
<div class="chat-transcript" markdown="1">
|
||||
|
||||
```
|
||||
app.py:23:36: F821 undefined name 'num'
|
||||
|
||||
app.py:
|
||||
...⋮...
|
||||
6│class LongNum:
|
||||
...⋮...
|
||||
19│ def expound(self, threshold):
|
||||
20│ number = self.basis
|
||||
21│ while number < threshold:
|
||||
22│ number *= self.factor
|
||||
23█ return num
|
||||
24│
|
||||
25│
|
||||
...⋮...
|
||||
```
|
||||
|
||||
> Attempt to fix lint errors? yes
|
||||
|
||||
</div>
|
||||
|
||||
In the benchmark, these linting suggestions are always accepted.
|
||||
At completion,
|
||||
aider reports a linting outcome that
|
||||
indicates if it was able to produce
|
||||
code without any outstanding linting errors.
|
||||
The benchmark harness uses this status as
|
||||
one of the criteria to determine if aider has
|
||||
created a plausible solution.
|
||||
|
||||
## Testing and fixing
|
||||
|
||||
The final crtieria for a plausible solution is that
|
||||
all tests must be passing.
|
||||
Aider can be configured with the command to run tests for a repo,
|
||||
and will automatically attempt to fix any test failures.
|
||||
|
||||
A user working on a python project might configure testing
|
||||
by launching aider like this:
|
||||
|
||||
```
|
||||
aider --test-cmd pytest
|
||||
```
|
||||
|
||||
For the benchmark, aider is configured with a test command that will run the
|
||||
tests that already exist in each problem's repository.
|
||||
SWE Bench problems are based on repositories from large open
|
||||
source projects with extensive existing test suites.
|
||||
This means that
|
||||
testing will fail if aider has broken any of these
|
||||
pre-existing tests or if any new
|
||||
tests that it created aren't passing.
|
||||
|
||||
As with editing and linting, aider reports a testing outcome
|
||||
that indicates if it completed with any outstanding failing tests.
|
||||
The benchmark harness uses this status when deciding if aider
|
||||
has produced a plausible solution.
|
||||
|
||||
To be clear, *aider cannot run or even see the held out "acceptance tests"* that
|
||||
are used to judge if a proposed solution correctly
|
||||
resolves the problem.
|
||||
Those tests are only run outside of aider and the benchmark harness,
|
||||
to compute the final benchmark statistics.
|
||||
|
||||
## Finding a plausible solution
|
||||
|
||||
Each time aider executes, it reports
|
||||
the outcome of the editing, linting, and testing
|
||||
steps.
|
||||
Each of these steps may complete successfully or
|
||||
return a status that indicates that there were outstanding
|
||||
problems that remain unresolved.
|
||||
|
||||
The benchmark harness uses these outcomes to determine if
|
||||
aider has produced a plausible
|
||||
solution to the current SWE Bench task.
|
||||
A plausible solution is one where aider
|
||||
returns saying that it
|
||||
edited the repo with no outstanding
|
||||
edit, lint, or test errors.
|
||||
In this case, aider's changes are recorded
|
||||
as the SWE Bench `model_patch` to be evaluated later with the
|
||||
acceptance tests.
|
||||
|
||||
If the solution is not plausible, another
|
||||
instance of aider is launched again from scratch on the same problem.
|
||||
The harness alternates launching aider with GPT-4o and Opus to solve the problem,
|
||||
and gives each model three attempts -- for a total of six attempts.
|
||||
As soon as a plausible solution is found, it is accepted and the
|
||||
harness moves on to the next SWE Bench instance.
|
||||
|
||||
It's worth noting that repositories may have lint or test errors
|
||||
present before aider even starts to edit them.
|
||||
Whether unresolved errors were caused by aider or were pre-existing,
|
||||
there will be instances where
|
||||
no plausible solution is
|
||||
found after six tries.
|
||||
|
||||
If all six attempts fail to produce a plausible solution,
|
||||
then the "best" solution available is selected as the
|
||||
`model_patch`.
|
||||
Which of the non-plausible solutions to use is determined
|
||||
by ignoring the testing outcome
|
||||
and prioritizing solutions in the following order:
|
||||
|
||||
- Pick a solution where editing and linting were completed successfully.
|
||||
- Pick a solution where editing was at least partially successful and linting succeeded.
|
||||
- Pick a solution where editing was successful.
|
||||
- Pick a solution where editing was at least partially successful.
|
||||
|
||||
## Computing the benchmark score
|
||||
|
||||
The benchmark harness produced a plausible solution for each of the 300
|
||||
SWE Bench Lite instances and saved it as the `model_patch`.
|
||||
|
||||
A separate evaluation script was used to
|
||||
test each of these solutions with the full test suite,
|
||||
including the held out acceptance tests.
|
||||
For this final acceptance testing, any edits that aider made to tests
|
||||
are discarded.
|
||||
This ensures that the correct,
|
||||
unmodified test suite is used for acceptance testing.
|
||||
The evaluation script compares the test results
|
||||
with results from testing
|
||||
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||
If they match, the candidate solution has correctly resolved the issue.
|
||||
|
||||
These acceptance tests are only ever run outside of aider
|
||||
and the benchmark harness, and only to compute the number of
|
||||
correctly resolved instances.
|
||||
They are never run, used, or even visible during aider's attempts to solve the problems.
|
||||
|
||||
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Much thanks to the team behind the
|
||||
[SWE Bench](https://www.swebench.com)
|
||||
family of AI coding benchmarks.
|
||||
Also thanks to Albert Örwall who has
|
||||
[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
|
||||
making it faster, easier, and more reliable to run the acceptance tests.
|
||||
|
||||
|
||||
## References
|
||||
|
||||
All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.
|
||||
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
Thus it is a pass@1 result.
|
||||
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N solution pass, that counts as a pass@N success.
|
||||
|
||||
Below are the references for the other pass@1 unhinted SWE-Bench results
|
||||
displayed in the graph at the beginning of this article.
|
||||
|
||||
- [20.3% Amazon Q Developer Agent (v20240430-dev)](https://www.swebench.com)
|
||||
- [19.0% AutoCodeRover](https://www.swebench.com/)
|
||||
- [18.0% SWE-Agent + GPT-4](https://www.swebench.com)
|
||||
- [16.7% OpenDevin](https://github.com/OpenDevin/OpenDevin/issues/2149)
|
||||
- [11.7% SWE-Agent + Opus](https://www.swebench.com)
|
||||
|
||||
Note, the graph was corrected on 5/30/24 as follows.
|
||||
|
||||
The graph now contains AutoCodeRover's average pass@1 results.
|
||||
Previously it displayed pass@3 results, which are
|
||||
not comparable
|
||||
to the pass@1 results for aider being reported here.
|
||||
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
|
||||
features pass@3 results
|
||||
without being clearly labeled.
|
||||
|
||||
The graph now contains the best OpenDevin results obtained without using
|
||||
the SWE Bench `hints_text` to provide hints to the agent.
|
||||
The previous graph contained their hinted result,
|
||||
which is not comparable
|
||||
to the unhinted aider results being reported here.
|
||||
[OpenDevin reported hinted results](https://x.com/gneubig/status/1791498953709752405)
|
||||
without noting that hints were used.
|
67
website/_posts/2024-05-24-self-assembly.md
Normal file
67
website/_posts/2024-05-24-self-assembly.md
Normal file
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
title: Aider has written 7% of its own code
|
||||
excerpt: Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.
|
||||
highlight_image: /assets/self-assembly.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
# Aider has written 7% of its own code
|
||||
|
||||
[](https://aider.chat/assets/self-assembly.jpg)
|
||||
|
||||
The
|
||||
[aider git repo](https://github.com/paul-gauthier/aider)
|
||||
currently contains about 4K commits and 14K lines of code.
|
||||
|
||||
Aider made 15% of the commits, inserting 4.8K and deleting 1.5K lines of code.
|
||||
|
||||
About 7% of the code now in the repo is attributable to an aider commit
|
||||
using `git blame`.
|
||||
This number is probably a significant undercount, because periodic reformatting
|
||||
by `black` is likely obscuring aider's authorship of many lines.
|
||||
|
||||
Here's the breakdown of the code aider wrote in the current code base
|
||||
according to `git blame`.
|
||||
|
||||
| File | Lines | Percent |
|
||||
|---|---:|---:|
|
||||
|aider/args.py| 6 of 449 | 1.3% |
|
||||
|aider/coders/base_coder.py| 37 of 1354 | 2.7% |
|
||||
|aider/coders/editblock_coder.py| 14 of 507 | 2.8% |
|
||||
|aider/coders/editblock_func_coder.py| 6 of 141 | 4.3% |
|
||||
|aider/coders/udiff_coder.py| 2 of 421 | 0.5% |
|
||||
|aider/coders/wholefile_coder.py| 5 of 146 | 3.4% |
|
||||
|aider/coders/wholefile_func_coder.py| 4 of 134 | 3.0% |
|
||||
|aider/commands.py| 67 of 703 | 9.5% |
|
||||
|aider/diffs.py| 15 of 129 | 11.6% |
|
||||
|aider/gui.py| 2 of 533 | 0.4% |
|
||||
|aider/history.py| 19 of 124 | 15.3% |
|
||||
|aider/io.py| 55 of 368 | 14.9% |
|
||||
|aider/linter.py| 30 of 240 | 12.5% |
|
||||
|aider/main.py| 30 of 466 | 6.4% |
|
||||
|aider/mdstream.py| 3 of 122 | 2.5% |
|
||||
|aider/models.py| 22 of 549 | 4.0% |
|
||||
|aider/repo.py| 19 of 266 | 7.1% |
|
||||
|aider/repomap.py| 17 of 518 | 3.3% |
|
||||
|aider/scrape.py| 12 of 199 | 6.0% |
|
||||
|aider/versioncheck.py| 10 of 37 | 27.0% |
|
||||
|aider/voice.py| 9 of 104 | 8.7% |
|
||||
|benchmark/benchmark.py| 33 of 730 | 4.5% |
|
||||
|benchmark/over_time.py| 32 of 60 | 53.3% |
|
||||
|benchmark/swe_bench_lite.py| 40 of 71 | 56.3% |
|
||||
|scripts/blame.py| 55 of 212 | 25.9% |
|
||||
|scripts/versionbump.py| 96 of 123 | 78.0% |
|
||||
|setup.py| 11 of 47 | 23.4% |
|
||||
|tests/test_coder.py| 48 of 612 | 7.8% |
|
||||
|tests/test_commands.py| 135 of 588 | 23.0% |
|
||||
|tests/test_editblock.py| 23 of 403 | 5.7% |
|
||||
|tests/test_io.py| 30 of 65 | 46.2% |
|
||||
|tests/test_main.py| 13 of 239 | 5.4% |
|
||||
|tests/test_models.py| 6 of 28 | 21.4% |
|
||||
|tests/test_repo.py| 2 of 296 | 0.7% |
|
||||
|tests/test_repomap.py| 70 of 217 | 32.3% |
|
||||
|tests/test_udiff.py| 7 of 119 | 5.9% |
|
||||
|tests/test_wholefile.py| 37 of 321 | 11.5% |
|
||||
| **Total** | **1022 of 14219** | 7.2% |
|
||||
|
||||
|
264
website/_posts/2024-06-02-main-swe-bench.md
Normal file
264
website/_posts/2024-06-02-main-swe-bench.md
Normal file
|
@ -0,0 +1,264 @@
|
|||
---
|
||||
title: Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||
excerpt: Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.
|
||||
highlight_image: /assets/swe_bench.jpg
|
||||
nav_exclude: true
|
||||
---
|
||||
|
||||
# Aider is SOTA for both SWE Bench and SWE Bench Lite
|
||||
|
||||
Aider scored 18.9%
|
||||
on the main
|
||||
[SWE Bench benchmark](https://www.swebench.com),
|
||||
achieving a state-of-the-art result.
|
||||
The current top leaderboard entry is 13.8%
|
||||
from Amazon Q Developer Agent.
|
||||
The best result reported elsewhere seems to be
|
||||
[13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
|
||||
|
||||
This result on the main SWE Bench builds on
|
||||
[aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
|
||||
[](https://aider.chat/assets/swe_bench.svg)
|
||||
|
||||
**All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.**
|
||||
Aider was benchmarked on the same
|
||||
[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
|
||||
that were used in the
|
||||
[Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
|
||||
See the [references](#references)
|
||||
for more details on the data presented in this chart.
|
||||
|
||||
## Interactive, not agentic
|
||||
|
||||
Aider achieved this result mainly through its existing features that focus on static
|
||||
code analysis, reliable LLM code editing, and pragmatic UX for automatically
|
||||
fixing linting and testing errors.
|
||||
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||
to avoid long delays, high token costs
|
||||
and the need for users to repeatedly code review incorrect solutions.
|
||||
It's also worth noting that aider currently does not use
|
||||
RAG, vector search, tools or give the LLM access to search the web
|
||||
or unilaterally execute code.
|
||||
|
||||
Aider is first and foremost an interactive tool for engineers to get real work done in
|
||||
real code bases using a chat interface.
|
||||
Aider provides a pair programming UX where users can ask for a change
|
||||
and see code edits performed in real-time.
|
||||
Aider can also offer additional help like fixing lint or test errors,
|
||||
but the user is always in full interactive control.
|
||||
This allows them to quickly steer misunderstandings back on course and
|
||||
avoid wasting time and token costs.
|
||||
|
||||
|
||||
## Benchmark methodology
|
||||
|
||||
Benchmarking was conducted as follows:
|
||||
|
||||
- Aider with GPT-4o was launched in each problem's git repository
|
||||
with the problem statement
|
||||
submitted as the opening chat message from "the user".
|
||||
- After that aider ran as normal, except all of aider's
|
||||
suggestions were always accepted without user approval.
|
||||
- A [simple harness](https://github.com/paul-gauthier/aider-swe-bench#the-aider-agent) was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||
without causing syntax errors or breaking any *pre-existing* tests.
|
||||
- If the solution from aider with GPT-4o wasn't plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
|
||||
- If no plausible solution was found after those two tries, the harness picked the "most plausible" solution with the fewest edit/lint/test problems.
|
||||
|
||||
It's important to be clear that
|
||||
*aider and the benchmark harness
|
||||
only had access to the pre-existing tests in each problem's repo*.
|
||||
The held out "acceptance tests" were *only* used
|
||||
after benchmarking to compute statistics on which problems aider
|
||||
correctly resolved.
|
||||
|
||||
This is the same approach
|
||||
that was used for
|
||||
[aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
For the Lite benchmark,
|
||||
aider alternated between GPT-4o and Opus for up to six total attempts.
|
||||
To manage the cost of running the main SWE Bench benchmark,
|
||||
aider was limited to two total attempts:
|
||||
one with GPT-4o and one with Opus.
|
||||
|
||||
For a detailed discussion of the benchmark
|
||||
methodology, see the
|
||||
[article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
|
||||
Also, the
|
||||
[aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
|
||||
contains the harness and statistics code used for the benchmarks.
|
||||
|
||||
The benchmarking process was similar to how a developer might use aider to
|
||||
resolve a GitHub issue:
|
||||
|
||||
- They could launch aider in their repo with the command below, which
|
||||
tells aider they want to accept every suggestion
|
||||
and to use pytest to run tests.
|
||||
- `aider --yes --test-cmd pytest`
|
||||
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||
Aider will pull in the URL's content and then try and resolve the issue.
|
||||
- If aider doesn't produce code that lints and tests clean, the user might decide to
|
||||
[use git to revert the changes](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||
and try again with `aider --opus`.
|
||||
|
||||
## Aider with GPT-4o alone was SOTA
|
||||
|
||||
Using aider with GPT-4o to make a single attempt at resolving each problem
|
||||
achieved a score of 17.0%.
|
||||
This was itself a state-of-the-art result, before being surpassed by the main
|
||||
result being reported here
|
||||
that used aider with both GPT-4o & Opus.
|
||||
|
||||
## Aider with GPT-4o & Opus
|
||||
|
||||
The benchmark harness started by using aider with GPT-4o to try
|
||||
and resolve each problem.
|
||||
For problems where this didn't produce a plausible solution,
|
||||
the harness tried again using aider with Opus.
|
||||
So at most, two attempts were made for each problem.
|
||||
|
||||
The table below breaks down the proposed solutions that
|
||||
were found from each attempt at the 570 problems.
|
||||
A proposed solution is either:
|
||||
|
||||
- A plausible solution where
|
||||
aider reported no outstanding errors from editing, linting and testing.
|
||||
- Or, the "most plausible" solution generated by either attempt, with the
|
||||
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
|
||||
|
||||
The table also provides details on the 108 solutions that were ultimately
|
||||
verified as correctly resolving their issue.
|
||||
|
||||
| Attempt | Agent |Number of<br>proposed<br>solutions|Percent of<br>proposed<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite |
|
||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
|
||||
| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
|
||||
| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
|
||||
|
||||
## Non-plausible but correct solutions?
|
||||
|
||||
A solution doesn't actually have to be plausible in order to correctly resolve the issue.
|
||||
Recall that plausible is simply defined as aider
|
||||
reporting that it successfully completed all file edits,
|
||||
repaired and resolved any linting errors
|
||||
and resolved any test failures.
|
||||
But there are many reasons why aider might fail to do those things
|
||||
and yet still produce a solution that will pass
|
||||
acceptance testing:
|
||||
|
||||
- There may have been pre-existing failing tests in the repo,
|
||||
before aider even started working on the SWE Bench problem.
|
||||
Aider may not have resolved such issues, and yet they may not be
|
||||
relevant to the acceptance testing.
|
||||
The SWE Bench acceptance testing just confirms that tests pass or fail
|
||||
in the same pattern as the "gold patch" developed by a human to resolve the
|
||||
problem.
|
||||
Some tests may fail during acceptance testing,
|
||||
and that's ok as long as they failed for the gold
|
||||
patch too.
|
||||
- There may have been pre-existing linting problems in the repo.
|
||||
If lingering linting issues affected code paths that are not well tested,
|
||||
they may not impact acceptance testing.
|
||||
- Aider may have reported file editing errors because it thought the LLM
|
||||
specified edits that it wasn't able to successfully apply.
|
||||
This can only happen when the LLM specified edits in
|
||||
a way that doesn't comply with the editing instructions in the system prompt.
|
||||
Given that the LLM isn't complying with the system prompt,
|
||||
it may have become confused and
|
||||
asked for redundant or otherwise irrelevant edits.
|
||||
Such outstanding edit errors might not be fatal for acceptance testing.
|
||||
- Etc.
|
||||
|
||||
Keeping all this in mind, we can understand why
|
||||
GPT-4o accounts for 15.3% of the benchmark score in the table above,
|
||||
but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
|
||||
When an Opus attempt is allowed after GPT-4o,
|
||||
it may propose some *incorrect* solutions which
|
||||
are "more plausible" than some of GPT-4o's non-plausible solutions.
|
||||
These more plausible, incorrect solutions can
|
||||
eclipse some of
|
||||
the earlier non-plausible correct solutions that GPT-4o generated.
|
||||
This is why GPT-4o's score in the table
|
||||
showing the combined GPT-4o & Opus results (15.3%)
|
||||
is lower than the result from just one try using aider with GPT-4o (17.0%).
|
||||
|
||||
For these reasons, adding additional attempts is not guaranteed to monotonically
|
||||
increase the number of resolved problems.
|
||||
New solutions may resolve some new problems but they may also
|
||||
eclipse and discard some of the previous non-plausible correct solutions.
|
||||
|
||||
Luckily, the net effect of additional attempts
|
||||
usually increases or at least maintains the
|
||||
number of resolved solutions.
|
||||
This was the case for all the attempts made in both this main SWE Bench result and the
|
||||
earlier Lite result.
|
||||
|
||||
## Computing the benchmark score
|
||||
|
||||
The benchmark harness produced one proposed solution for each of
|
||||
the 570 SWE Bench problems.
|
||||
|
||||
A separate evaluation script was used to
|
||||
test each of these solutions with the full test suite,
|
||||
including the held out acceptance tests.
|
||||
For this final acceptance testing, any edits that aider made to tests
|
||||
were discarded.
|
||||
This ensured that the correct,
|
||||
unmodified test suite was used for acceptance testing.
|
||||
The evaluation script compared each proposed solution's test results
|
||||
with results from testing
|
||||
the "gold" patch that was developed by a human to correctly resolve the issue.
|
||||
If they matched, the proposed solution correctly resolved the issue.
|
||||
|
||||
These acceptance tests were only ever run outside of aider
|
||||
and the benchmark harness, and only to compute statistics about the
|
||||
correctly resolved instances.
|
||||
They were never run, used, or even visible during aider's attempts to resolve the problems.
|
||||
|
||||
Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
|
||||
or 18.9%.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
Much thanks to the team behind the
|
||||
[SWE Bench](https://www.swebench.com)
|
||||
family of AI coding benchmarks.
|
||||
Also thanks to Albert Örwall who has
|
||||
[dockerized the SWE Bench evaluation scripts](https://github.com/aorwall/SWE-bench-docker)
|
||||
making it faster, easier, and more reliable to run the acceptance tests.
|
||||
|
||||
|
||||
## References
|
||||
|
||||
All of aider's results reported here are pass@1 results,
|
||||
obtained without using the SWE Bench `hints_text`.
|
||||
|
||||
The "aider agent" internally makes multiple "attempts" at solving the problem,
|
||||
but it picks and returns one single candidate solution.
|
||||
Only that one candidate solution is evaluated with the acceptance tests
|
||||
and contributes to the benchmark score.
|
||||
Thus it is a pass@1 result.
|
||||
|
||||
This is contrast to a pass@N result for N>1, where N attempts are made
|
||||
and all N solutions are evaluated by the acceptance tests.
|
||||
If *any* of the N solution pass, that counts as a pass@N success.
|
||||
|
||||
Below are the references for the other pass@1 unhinted SWE-Bench results
|
||||
displayed in the graph at the beginning of this article.
|
||||
|
||||
- [13.9% Devin, benchmarked on 570 instances.](https://www.cognition.ai/post/swe-bench-technical-report)
|
||||
- [13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
- [12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
- [10.6% AutoCode Rover, benchmarked on 2,294 instances.](https://arxiv.org/pdf/2404.05427v2)
|
||||
- [10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.](https://www.swebench.com)
|
||||
|
||||
The graph contains average pass@1 results for AutoCodeRover.
|
||||
The [AutoCodeRover GitHub page](https://github.com/nus-apr/auto-code-rover)
|
||||
features their pass@3 results
|
||||
without being clearly labeled.
|
||||
Table 2 of their
|
||||
[paper](https://arxiv.org/pdf/2404.05427v2)
|
||||
reports an `ACR-avg` result of 10.59% which is an average pass@1 result.
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue