This commit is contained in:
Paul Gauthier 2023-11-08 11:11:47 -08:00
parent cb63b61411
commit 6acc3689e5
2 changed files with 6 additions and 15 deletions

View file

@ -71,9 +71,3 @@ The comments below only focus on comparing the `whole` edit format results:
- The new `gpt-3.5-turbo-1106` model is completing the benchmark **3-4X faster** than the earlier GPT-3.5 models. - The new `gpt-3.5-turbo-1106` model is completing the benchmark **3-4X faster** than the earlier GPT-3.5 models.
- The success rate after the first try of 42% is comparable to the previous June (0613) model. The new November and previous June models are both worse than the original March (0301) model's 50% result on the first try. - The success rate after the first try of 42% is comparable to the previous June (0613) model. The new November and previous June models are both worse than the original March (0301) model's 50% result on the first try.
- The new model's 56% success rate after the second try seems comparable to the original March model, and somewhat better than the June model's 50% score. - The new model's 56% success rate after the second try seems comparable to the original March model, and somewhat better than the June model's 50% score.
### Updates
I will update the results on this page as quickly as my rate limit allows.

View file

@ -15,8 +15,8 @@ Aider relies on a
to quantitatively evaluate to quantitatively evaluate
performance. performance.
This is the latest in a series of benchmarking reports This is the latest in a series of reports
about the code that use the aider benchmarking suite to assess and compare the code
editing capabilities of OpenAI's GPT models. You can review previous editing capabilities of OpenAI's GPT models. You can review previous
reports to get more background on aider's benchmark suite: reports to get more background on aider's benchmark suite:
@ -44,13 +44,10 @@ Some observations:
OpenAI is enforcing very low OpenAI is enforcing very low
rate limits on the new GPT-4 model. rate limits on the new GPT-4 model.
The rate limiting disrupts the the benchmarking process, The rate limiting disrupts the the benchmarking process,
requiring it to be run single threaded, paused and restarted frequently. requiring it to run single threaded, pause and restart frequently.
These anomolous conditions make it slow to These anomolous conditions make it slow to
benchmark the new model, and make comparisons against benchmark the new model, and make
the older versions less reliable. it less reliable to compare the results with
benchmark runs against the older model versions.
Once the rate limits are relaxed I will do a clean Once the rate limits are relaxed I will do a clean
run of the entire benchmark suite. run of the entire benchmark suite.
### Updates
I will update the results on this page as quickly as my rate limit allows.