This commit is contained in:
Paul Gauthier 2023-11-08 08:36:48 -08:00
parent 2c83287c46
commit 59fed25cd1
4 changed files with 72 additions and 69 deletions

View file

@ -45,22 +45,20 @@ This is the edit format that aider uses by default with gpt-4.
- The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up.
- **It seems better at producing correct code on the first try**. It gets
~54% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
53% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
- The new model seems to perform similar
(~63%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
(~62%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
**These are preliminary results.**
OpenAI is enforcing very low
rate limits on the new GPT-4 model.
The rate limiting is disrupting the normal flow of the benchmarking process,
which needs to be restarted after pauses.
The benchmarking tool is capable of such restarts, but
I will trust a "clean" run much better once the rate limits are relaxed.
The results currently reflect
130
out of the 133 Exercism problems.
The problems are selected in random order, so results should be *roughly*
indicative of the full benchmark.
The rate limiting disrupts the the benchmarking process,
requiring it to be paused and restarted frequently.
It took ~20 partial runs over ~2 days to complete all 133 Exercism problems.
The benchmarking harness is designed to stop/restart in this manner,
but results from a single "clean" run would be more trustworthy.
Once the rate limits are relaxed I will do a clean
run of the entire benchmark.
### gpt-3.5-turbo-1106

View file

@ -69,7 +69,7 @@ More details on the benchmark, edit formats and results are discussed below.
## The benchmark
The benchmark uses
The benchmark uses
[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
These
exercises were designed to help individuals learn Python and hone
@ -199,7 +199,7 @@ demo.py
### whole-func
The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py)
The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py)
format requests updated copies of whole files to be returned using the function call API.
@ -218,7 +218,7 @@ format requests updated copies of whole files to be returned using the function
The
[diff-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/editblock_func_coder.py)
format requests a list of
format requests a list of
original/updated style edits to be returned using the function call API.
```
@ -235,7 +235,7 @@ original/updated style edits to be returned using the function call API.
],
}
]
}
}
```
## GPT-3.5's performance
@ -307,7 +307,7 @@ The benchmark harness also logs SHA hashes of
all the OpenAI API requests and replies.
This makes it possible to
detect randomness or nondeterminism
in the bechmarking process.
in the benchmarking process.
It turns out that the OpenAI chat APIs are not deterministic, even at
`temperature=0`. The same identical request will produce multiple