mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-16 09:34:59 +00:00
copy
This commit is contained in:
parent
a57dd90a49
commit
bd56adf16f
1 changed files with 7 additions and 7 deletions
|
@ -50,19 +50,19 @@ After that aider runs as normal, with the following modifications:
|
||||||
|
|
||||||
- Aider's suggestions were always accepted without user approval.
|
- Aider's suggestions were always accepted without user approval.
|
||||||
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||||
Plausibly correct means that aider concluded that it had successfully edited the repo
|
Plausibly correct means that aider reported that it had successfully edited the repo
|
||||||
without causing syntax errors or breaking any *pre-existing* tests.
|
without causing syntax errors or breaking any *pre-existing* tests.
|
||||||
- If the solution isn't plausible, the harness launches aider to try again from scratch,
|
- If the solution isn't plausible, the harness launches aider to try again from scratch,
|
||||||
alternating between using aider with GPT-4o and Opus.
|
alternating between using aider with GPT-4o and Opus.
|
||||||
- If no plausible solution is found after six tries, the harness picks the solution
|
- If no plausible solution is found after six tries, the harness picks the solution
|
||||||
with the least amount of edit/lint/test problems.
|
with the fewest edit/lint/test problems.
|
||||||
|
|
||||||
It's important to be clear that
|
It's important to be clear that
|
||||||
*aider and the benchmark harness
|
*aider and the benchmark harness
|
||||||
only had access to the pre-existing tests in each problem's repo*.
|
only had access to the pre-existing tests in each problem's repo*.
|
||||||
They could not see or run the held out "acceptance tests" that are used
|
The held out "acceptance tests" were *only* used
|
||||||
after benchmarking to see if the
|
after benchmarking to compute statistics on which problems aider
|
||||||
SWE Bench problem was correctly resolved.
|
correctly resolved.
|
||||||
|
|
||||||
The benchmarking process was similar to how a developer might use aider to
|
The benchmarking process was similar to how a developer might use aider to
|
||||||
resolve a GitHub issue:
|
resolve a GitHub issue:
|
||||||
|
@ -312,10 +312,10 @@ The benchmark harness uses this status when deciding if aider
|
||||||
has produced a plausible solution.
|
has produced a plausible solution.
|
||||||
|
|
||||||
To be clear, *aider cannot run or even see the held out "acceptance tests"* that
|
To be clear, *aider cannot run or even see the held out "acceptance tests"* that
|
||||||
are used to determine if a proposed solution correctly
|
are used to judge if a proposed solution correctly
|
||||||
resolves the problem.
|
resolves the problem.
|
||||||
Those tests are only run outside of aider and the benchmark harness,
|
Those tests are only run outside of aider and the benchmark harness,
|
||||||
to compute the final benchmark score.
|
to compute the final benchmark statistics.
|
||||||
|
|
||||||
## Finding a plausible solution
|
## Finding a plausible solution
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue