This commit is contained in:
Paul Gauthier 2024-05-22 18:26:04 -07:00
parent 18e3f55c4e
commit 5e13399f46

View file

@ -39,7 +39,7 @@ This lets them quickly steer misunderstandings back on course and
avoid wasted time, code reviews and token costs. avoid wasted time, code reviews and token costs.
## Methodology ## Benchmark methodology
For the benchmark, For the benchmark,
aider was launched in each problem's git repository aider was launched in each problem's git repository
@ -61,18 +61,18 @@ It's important to be clear that during benchmarking
It could not see or run the held out "acceptance tests" that are used later to see if the It could not see or run the held out "acceptance tests" that are used later to see if the
SWE Bench problem was correctly resolved. SWE Bench problem was correctly resolved.
The benchmarking process can be thought of as similar to a user: The benchmarking process was similar to a user employing aider like this:
- Launching aider in their repo with the something like command below, which - Launching aider in their repo with the something like command below, which
tells aider to say yes to every suggestion and use pytest to run tests. tells aider to say yes to every suggestion and use pytest to run tests.
- `aider --yes --test-cmd pytest` - `aider --yes --test-cmd pytest`
- Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like: - Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like:
- `/web https://github.com/django/django/issues/XXX` - `/web https://github.com/django/django/issues/XXX`
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe with a different LLM this time. - If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git), [Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
so it's always easy to revert AI changes that don't pan out. so it's always easy to revert AI changes that don't pan out.
Of course, outside a benchmark setting, it's probably Outside a benchmark setting, it's probably
unwise to let *any* AI agent run unsupervised on your code base. unwise to let *any* AI agent run unsupervised on your code base.
Aider is intended to be used as an interactive pair-programming chat, Aider is intended to be used as an interactive pair-programming chat,
where the user participates to direct aider's work and approve suggestions. where the user participates to direct aider's work and approve suggestions.
@ -82,7 +82,7 @@ or if the AI starts going down a wrong path.
## Aider with GPT-4o alone was SOTA ## Aider with GPT-4o alone was SOTA
Running the entire SWE Bench Lite benchmark using aider with just GPT-4o Running the SWE Bench Lite benchmark using aider with just GPT-4o
achieved a score of 25%. achieved a score of 25%.
This was itself a state-of-the-art result, before being surpassed by the main This was itself a state-of-the-art result, before being surpassed by the main
result being reported here result being reported here
@ -203,7 +203,7 @@ respected when new code is added.
[Aider lints code](https://aider.chat/2024/05/22/linting.html) [Aider lints code](https://aider.chat/2024/05/22/linting.html)
after every LLM edit and offers to automatically fix after every LLM edit and offers to automatically fix
any linting errors. any linting errors.
Aider includes basic linters built with tree-sitter that supports Aider includes basic linters built with tree-sitter to check
[most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py). [most popular programming languages](https://github.com/paul-gauthier/grep-ast/blob/main/grep_ast/parsers.py).
These built in linters will detect syntax errors and other fatal problems with the code. These built in linters will detect syntax errors and other fatal problems with the code.
@ -220,36 +220,38 @@ make the correct changes to resolve it.
<div class="chat-transcript" markdown="1"> <div class="chat-transcript" markdown="1">
> app.py:23:36: F821 undefined name 'num' ```
> app.py:41:16: F541 f-string is missing placeholders app.py:23:36: F821 undefined name 'num'
> app.py:41:16: F541 f-string is missing placeholders
> app.py:
> ...⋮... app.py:
> 6│class LongNum: ...⋮...
> 7│ def __init__(self, num): 6│class LongNum:
> 8│ """ 7│ def __init__(self, num):
> 9│ Initialize the number. 8│ """
> 10│ """ 9│ Initialize the number.
> ...⋮... 10│ """
> 19│ def __str__(self): ...⋮...
> 20│ """ 19│ def __str__(self):
> 21│ Render the number as a string. 20│ """
> 22│ """ 21│ Render the number as a string.
> 23█ return str(num) 22│ """
> 24│ 23█ return str(num)
> 25│ 24│
> 26│@app.route('/subtract/<int:x>/<int:y>') 25│
> ...⋮... 26│@app.route('/subtract/<int:x>/<int:y>')
> 38│@app.route('/divide/<int:x>/<int:y>') ...⋮...
> 39│def divide(x, y): 38│@app.route('/divide/<int:x>/<int:y>')
> 40│ if y == 0: 39│def divide(x, y):
> 41█ return f"Error: Cannot divide by zero" 40│ if y == 0:
> 42│ else: 41█ return f"Error: Cannot divide by zero"
> 43│ result = x / y 42│ else:
> 44│ return str(result) 43│ result = x / y
> 45│ 44│ return str(result)
> ...⋮... 45│
> ...⋮...
```
> Attempt to fix lint errors? yes > Attempt to fix lint errors? yes
</div> </div>