This commit is contained in:
Paul Gauthier 2023-11-08 11:11:47 -08:00
parent cb63b61411
commit 6acc3689e5
2 changed files with 6 additions and 15 deletions

View file

@ -71,9 +71,3 @@ The comments below only focus on comparing the `whole` edit format results:
- The new `gpt-3.5-turbo-1106` model is completing the benchmark **3-4X faster** than the earlier GPT-3.5 models.
- The success rate after the first try of 42% is comparable to the previous June (0613) model. The new November and previous June models are both worse than the original March (0301) model's 50% result on the first try.
- The new model's 56% success rate after the second try seems comparable to the original March model, and somewhat better than the June model's 50% score.
### Updates
I will update the results on this page as quickly as my rate limit allows.

View file

@ -15,8 +15,8 @@ Aider relies on a
to quantitatively evaluate
performance.
This is the latest in a series of benchmarking reports
about the code
This is the latest in a series of reports
that use the aider benchmarking suite to assess and compare the code
editing capabilities of OpenAI's GPT models. You can review previous
reports to get more background on aider's benchmark suite:
@ -44,13 +44,10 @@ Some observations:
OpenAI is enforcing very low
rate limits on the new GPT-4 model.
The rate limiting disrupts the the benchmarking process,
requiring it to be run single threaded, paused and restarted frequently.
requiring it to run single threaded, pause and restart frequently.
These anomolous conditions make it slow to
benchmark the new model, and make comparisons against
the older versions less reliable.
benchmark the new model, and make
it less reliable to compare the results with
benchmark runs against the older model versions.
Once the rate limits are relaxed I will do a clean
run of the entire benchmark suite.
### Updates
I will update the results on this page as quickly as my rate limit allows.