This commit is contained in:
Paul Gauthier 2024-05-23 10:27:49 -07:00
parent c0f5cbb9b5
commit b56aa129dc

View file

@ -96,7 +96,7 @@ result being reported here
that used aider with both GPT-4o & Opus. that used aider with both GPT-4o & Opus.
As noted below, a single attempt using Aider with GPT-4o tied As noted below, a single attempt using Aider with GPT-4o tied
the current top entry on the leader. the current top entry on the leaderboard.
## Aider with GPT-4o & Opus ## Aider with GPT-4o & Opus
@ -105,17 +105,19 @@ The harness proceeded in a fixed order, always starting with GPT-4o and
then alternating with Opus until a plausible solution was found for each then alternating with Opus until a plausible solution was found for each
problem. problem.
The table below breaks down the 79 solutions that were ultimately The table below breaks down the plausible solutions that
were found for the 300 problems.
It also provides details on the 79 that were ultimately
verified as correctly resolving their issue. verified as correctly resolving their issue.
Some noteworthy observations: Some noteworthy observations:
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. - *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results. - Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions. These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem. - A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
| Attempt | Agent |Number<br>plausible<br>solutions|Percent&nbsp;of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent&nbsp;of<br>correctly<br>resolved | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite | | Attempt | Agent |Number&nbsp;of<br>plausible<br>solutions|Percent&nbsp;of<br>plausible<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite<br>(resolved/300) |
|:--------:|------------|---------:|---------:|----:|---:|--:| |:--------:|------------|---------:|---------:|----:|---:|--:|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% | | 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% | | 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
@ -138,10 +140,10 @@ with a greater chance of going on to be accepted as resolving the issue.
Again, this is biased by the turn ordering. Again, this is biased by the turn ordering.
But other anecdotal evidence from earlier runs of the benchmark But other anecdotal evidence from earlier runs of the benchmark
also supports the observation that aider with GPT-4o is significantly stronger than Opus also supports the observation that aider with GPT-4o is significantly stronger than Opus
for this endeavor. for this benchmark.
| Agent | Number<br>plausible<br>solutions | Number<br>correctly<br>resolved | Percent<br>plausible<br>which<br>resolved| | Agent | Number&nbsp;of<br>plausible<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>plausible<br>which<br>correctly<br>resolved<br>|
|------------|---------:|---------:|---:| |------------|---------:|---------:|---:|
| Aider with GPT-4o | 239 | 66 |27.6% | | Aider with GPT-4o | 239 | 66 |27.6% |
| Aider with Opus | 61 | 13 |21.3% | | Aider with Opus | 61 | 13 |21.3% |
@ -194,7 +196,7 @@ Aider successfully identified the correct file to edit
in 70.3% of the benchmark tasks. in 70.3% of the benchmark tasks.
We can determine which file needed to be edited using the "gold" patch We can determine which file needed to be edited using the "gold" patch
which is associated with each SWE Bench Task. which is associated with each SWE Bench task.
This patch was created by a human developer This patch was created by a human developer
to solve the issue, and therefore reveals a file which can to solve the issue, and therefore reveals a file which can
be edited to solve the problem. be edited to solve the problem.
@ -251,33 +253,18 @@ make the correct changes to resolve it.
``` ```
app.py:23:36: F821 undefined name 'num' app.py:23:36: F821 undefined name 'num'
app.py:41:16: F541 f-string is missing placeholders
app.py: app.py:
...⋮... ...⋮...
6│class LongNum: 6│class LongNum:
7│ def __init__(self, num):
8│ """
9│ Initialize the number.
10│ """
...⋮... ...⋮...
19│ def __str__(self): 19│ def expound(self, threshold):
20│ """ 20│ number = self.basis
21│ Render the number as a string. 21│ while number < threshold:
22│ """ 22│ number *= self.factor
23█ return str(num) 23█ return num
24│ 24│
25│ 25│
26│@app.route('/subtract/<int:x>/<int:y>')
...⋮...
38│@app.route('/divide/<int:x>/<int:y>')
39│def divide(x, y):
40│ if y == 0:
41█ return f"Error: Cannot divide by zero"
42│ else:
43│ result = x / y
44│ return str(result)
45│
...⋮... ...⋮...
``` ```
@ -288,7 +275,7 @@ app.py:
In the benchmark, these linting suggestions are always accepted. In the benchmark, these linting suggestions are always accepted.
At completion, At completion,
aider reports a linting outcome that aider reports a linting outcome that
indicates if it was able to ultimately produce indicates if it was able to produce
code without any outstanding linting errors. code without any outstanding linting errors.
The benchmark harness used this status as The benchmark harness used this status as
one of the criteria to determine if aider has one of the criteria to determine if aider has
@ -298,8 +285,8 @@ created a plausible solution.
The final crtieria for a plausible solution is that The final crtieria for a plausible solution is that
all tests must be passing. all tests must be passing.
Aider can be configured with the command needed to run tests for a repo, Aider can be configured with the command to run tests for a repo,
and will automatically attempt to fix any testing errors. and will automatically attempt to fix any test failures.
A user working on a python project might configure testing A user working on a python project might configure testing
by launching aider like this: by launching aider like this:
@ -318,11 +305,11 @@ pre-existing tests or if any new
tests that it created aren't passing. tests that it created aren't passing.
As with editing and linting, aider reports a testing outcome As with editing and linting, aider reports a testing outcome
that indicates if it completed with any outstanding testing errors. that indicates if it completed with any outstanding failing tests.
The benchmark harness uses this status when deciding if aider The benchmark harness uses this status when deciding if aider
has produced a plausible solution. has produced a plausible solution.
To be clear, *aider cannot run or even see the "acceptance tests"* To be clear, *aider cannot run or even see the held out "acceptance tests"*
that are used to determine if a proposed solution correctly that are used to determine if a proposed solution correctly
resolves the problem. resolves the problem.
Those tests are only run outside of aider and the benchmark harness, Those tests are only run outside of aider and the benchmark harness,
@ -390,9 +377,7 @@ with results from testing
the "gold" patch that was developed by a human to correctly solve the issue. the "gold" patch that was developed by a human to correctly solve the issue.
If they match, the candidate solution has correctly resolved the issue. If they match, the candidate solution has correctly resolved the issue.
These acceptance tests are only ever run outside of aider
These so called `test_patch` acceptance tests are only ever run outside of aider
and the benchmark harness, and only to compute the number of and the benchmark harness, and only to compute the number of
correctly resolved instances. correctly resolved instances.
They are never run, used, or even visible during aider's attempts to solve the problems. They are never run, used, or even visible during aider's attempts to solve the problems.