mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-28 00:05:01 +00:00
copy
This commit is contained in:
parent
2c83287c46
commit
59fed25cd1
4 changed files with 72 additions and 69 deletions
|
@ -45,22 +45,20 @@ This is the edit format that aider uses by default with gpt-4.
|
|||
|
||||
- The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up.
|
||||
- **It seems better at producing correct code on the first try**. It gets
|
||||
~54% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
|
||||
53% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
|
||||
- The new model seems to perform similar
|
||||
(~63%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
|
||||
(~62%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
|
||||
|
||||
**These are preliminary results.**
|
||||
OpenAI is enforcing very low
|
||||
rate limits on the new GPT-4 model.
|
||||
The rate limiting is disrupting the normal flow of the benchmarking process,
|
||||
which needs to be restarted after pauses.
|
||||
The benchmarking tool is capable of such restarts, but
|
||||
I will trust a "clean" run much better once the rate limits are relaxed.
|
||||
The results currently reflect
|
||||
130
|
||||
out of the 133 Exercism problems.
|
||||
The problems are selected in random order, so results should be *roughly*
|
||||
indicative of the full benchmark.
|
||||
The rate limiting disrupts the the benchmarking process,
|
||||
requiring it to be paused and restarted frequently.
|
||||
It took ~20 partial runs over ~2 days to complete all 133 Exercism problems.
|
||||
The benchmarking harness is designed to stop/restart in this manner,
|
||||
but results from a single "clean" run would be more trustworthy.
|
||||
Once the rate limits are relaxed I will do a clean
|
||||
run of the entire benchmark.
|
||||
|
||||
### gpt-3.5-turbo-1106
|
||||
|
||||
|
|
|
@ -69,7 +69,7 @@ More details on the benchmark, edit formats and results are discussed below.
|
|||
|
||||
## The benchmark
|
||||
|
||||
The benchmark uses
|
||||
The benchmark uses
|
||||
[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
|
||||
These
|
||||
exercises were designed to help individuals learn Python and hone
|
||||
|
@ -199,7 +199,7 @@ demo.py
|
|||
|
||||
### whole-func
|
||||
|
||||
The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py)
|
||||
The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py)
|
||||
format requests updated copies of whole files to be returned using the function call API.
|
||||
|
||||
|
||||
|
@ -218,7 +218,7 @@ format requests updated copies of whole files to be returned using the function
|
|||
|
||||
The
|
||||
[diff-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/editblock_func_coder.py)
|
||||
format requests a list of
|
||||
format requests a list of
|
||||
original/updated style edits to be returned using the function call API.
|
||||
|
||||
```
|
||||
|
@ -235,7 +235,7 @@ original/updated style edits to be returned using the function call API.
|
|||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## GPT-3.5's performance
|
||||
|
@ -307,7 +307,7 @@ The benchmark harness also logs SHA hashes of
|
|||
all the OpenAI API requests and replies.
|
||||
This makes it possible to
|
||||
detect randomness or nondeterminism
|
||||
in the bechmarking process.
|
||||
in the benchmarking process.
|
||||
|
||||
It turns out that the OpenAI chat APIs are not deterministic, even at
|
||||
`temperature=0`. The same identical request will produce multiple
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue