mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
copy
This commit is contained in:
parent
c0f5cbb9b5
commit
b56aa129dc
1 changed files with 20 additions and 35 deletions
|
@ -96,7 +96,7 @@ result being reported here
|
||||||
that used aider with both GPT-4o & Opus.
|
that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
As noted below, a single attempt using Aider with GPT-4o tied
|
As noted below, a single attempt using Aider with GPT-4o tied
|
||||||
the current top entry on the leader.
|
the current top entry on the leaderboard.
|
||||||
|
|
||||||
## Aider with GPT-4o & Opus
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
|
@ -105,17 +105,19 @@ The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||||
then alternating with Opus until a plausible solution was found for each
|
then alternating with Opus until a plausible solution was found for each
|
||||||
problem.
|
problem.
|
||||||
|
|
||||||
The table below breaks down the 79 solutions that were ultimately
|
The table below breaks down the plausible solutions that
|
||||||
|
were found for the 300 problems.
|
||||||
|
It also provides details on the 79 that were ultimately
|
||||||
verified as correctly resolving their issue.
|
verified as correctly resolving their issue.
|
||||||
Some noteworthy observations:
|
Some noteworthy observations:
|
||||||
|
|
||||||
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
||||||
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
|
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
|
||||||
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
||||||
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
||||||
|
|
||||||
|
|
||||||
| Attempt | Agent |Number<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent of<br>correctly<br>resolved | Score on<br>SWE Bench<br>Lite |
|
| Attempt | Agent |Number of<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite<br>(resolved/300) |
|
||||||
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||||
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
|
||||||
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
|
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
|
||||||
|
@ -138,10 +140,10 @@ with a greater chance of going on to be accepted as resolving the issue.
|
||||||
Again, this is biased by the turn ordering.
|
Again, this is biased by the turn ordering.
|
||||||
But other anecdotal evidence from earlier runs of the benchmark
|
But other anecdotal evidence from earlier runs of the benchmark
|
||||||
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||||
for this endeavor.
|
for this benchmark.
|
||||||
|
|
||||||
|
|
||||||
| Agent | Number<br>plausible<br>solutions | Number<br>correctly<br>resolved | Percent<br>plausible<br>which<br>resolved|
|
| Agent | Number of<br>plausible<br>solutions | Number of<br>correctly<br>resolved<br>solutions | Percent of<br>plausible<br>which<br>correctly<br>resolved<br>|
|
||||||
|------------|---------:|---------:|---:|
|
|------------|---------:|---------:|---:|
|
||||||
| Aider with GPT-4o | 239 | 66 |27.6% |
|
| Aider with GPT-4o | 239 | 66 |27.6% |
|
||||||
| Aider with Opus | 61 | 13 |21.3% |
|
| Aider with Opus | 61 | 13 |21.3% |
|
||||||
|
@ -194,7 +196,7 @@ Aider successfully identified the correct file to edit
|
||||||
in 70.3% of the benchmark tasks.
|
in 70.3% of the benchmark tasks.
|
||||||
|
|
||||||
We can determine which file needed to be edited using the "gold" patch
|
We can determine which file needed to be edited using the "gold" patch
|
||||||
which is associated with each SWE Bench Task.
|
which is associated with each SWE Bench task.
|
||||||
This patch was created by a human developer
|
This patch was created by a human developer
|
||||||
to solve the issue, and therefore reveals a file which can
|
to solve the issue, and therefore reveals a file which can
|
||||||
be edited to solve the problem.
|
be edited to solve the problem.
|
||||||
|
@ -251,33 +253,18 @@ make the correct changes to resolve it.
|
||||||
|
|
||||||
```
|
```
|
||||||
app.py:23:36: F821 undefined name 'num'
|
app.py:23:36: F821 undefined name 'num'
|
||||||
app.py:41:16: F541 f-string is missing placeholders
|
|
||||||
|
|
||||||
app.py:
|
app.py:
|
||||||
...⋮...
|
...⋮...
|
||||||
6│class LongNum:
|
6│class LongNum:
|
||||||
7│ def __init__(self, num):
|
|
||||||
8│ """
|
|
||||||
9│ Initialize the number.
|
|
||||||
10│ """
|
|
||||||
...⋮...
|
...⋮...
|
||||||
19│ def __str__(self):
|
19│ def expound(self, threshold):
|
||||||
20│ """
|
20│ number = self.basis
|
||||||
21│ Render the number as a string.
|
21│ while number < threshold:
|
||||||
22│ """
|
22│ number *= self.factor
|
||||||
23█ return str(num)
|
23█ return num
|
||||||
24│
|
24│
|
||||||
25│
|
25│
|
||||||
26│@app.route('/subtract/<int:x>/<int:y>')
|
|
||||||
...⋮...
|
|
||||||
38│@app.route('/divide/<int:x>/<int:y>')
|
|
||||||
39│def divide(x, y):
|
|
||||||
40│ if y == 0:
|
|
||||||
41█ return f"Error: Cannot divide by zero"
|
|
||||||
42│ else:
|
|
||||||
43│ result = x / y
|
|
||||||
44│ return str(result)
|
|
||||||
45│
|
|
||||||
...⋮...
|
...⋮...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -288,7 +275,7 @@ app.py:
|
||||||
In the benchmark, these linting suggestions are always accepted.
|
In the benchmark, these linting suggestions are always accepted.
|
||||||
At completion,
|
At completion,
|
||||||
aider reports a linting outcome that
|
aider reports a linting outcome that
|
||||||
indicates if it was able to ultimately produce
|
indicates if it was able to produce
|
||||||
code without any outstanding linting errors.
|
code without any outstanding linting errors.
|
||||||
The benchmark harness used this status as
|
The benchmark harness used this status as
|
||||||
one of the criteria to determine if aider has
|
one of the criteria to determine if aider has
|
||||||
|
@ -298,8 +285,8 @@ created a plausible solution.
|
||||||
|
|
||||||
The final crtieria for a plausible solution is that
|
The final crtieria for a plausible solution is that
|
||||||
all tests must be passing.
|
all tests must be passing.
|
||||||
Aider can be configured with the command needed to run tests for a repo,
|
Aider can be configured with the command to run tests for a repo,
|
||||||
and will automatically attempt to fix any testing errors.
|
and will automatically attempt to fix any test failures.
|
||||||
|
|
||||||
A user working on a python project might configure testing
|
A user working on a python project might configure testing
|
||||||
by launching aider like this:
|
by launching aider like this:
|
||||||
|
@ -318,11 +305,11 @@ pre-existing tests or if any new
|
||||||
tests that it created aren't passing.
|
tests that it created aren't passing.
|
||||||
|
|
||||||
As with editing and linting, aider reports a testing outcome
|
As with editing and linting, aider reports a testing outcome
|
||||||
that indicates if it completed with any outstanding testing errors.
|
that indicates if it completed with any outstanding failing tests.
|
||||||
The benchmark harness uses this status when deciding if aider
|
The benchmark harness uses this status when deciding if aider
|
||||||
has produced a plausible solution.
|
has produced a plausible solution.
|
||||||
|
|
||||||
To be clear, *aider cannot run or even see the "acceptance tests"*
|
To be clear, *aider cannot run or even see the held out "acceptance tests"*
|
||||||
that are used to determine if a proposed solution correctly
|
that are used to determine if a proposed solution correctly
|
||||||
resolves the problem.
|
resolves the problem.
|
||||||
Those tests are only run outside of aider and the benchmark harness,
|
Those tests are only run outside of aider and the benchmark harness,
|
||||||
|
@ -390,9 +377,7 @@ with results from testing
|
||||||
the "gold" patch that was developed by a human to correctly solve the issue.
|
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||||
If they match, the candidate solution has correctly resolved the issue.
|
If they match, the candidate solution has correctly resolved the issue.
|
||||||
|
|
||||||
|
These acceptance tests are only ever run outside of aider
|
||||||
|
|
||||||
These so called `test_patch` acceptance tests are only ever run outside of aider
|
|
||||||
and the benchmark harness, and only to compute the number of
|
and the benchmark harness, and only to compute the number of
|
||||||
correctly resolved instances.
|
correctly resolved instances.
|
||||||
They are never run, used, or even visible during aider's attempts to solve the problems.
|
They are never run, used, or even visible during aider's attempts to solve the problems.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue