From 6acc3689e50172497a2ac20558926cde2719e35f Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 8 Nov 2023 11:11:47 -0800 Subject: [PATCH] copy --- docs/benchmarks-1106.md | 6 ------ docs/benchmarks-speed-1106.md | 15 ++++++--------- 2 files changed, 6 insertions(+), 15 deletions(-) diff --git a/docs/benchmarks-1106.md b/docs/benchmarks-1106.md index efc7f3a89..78bed5760 100644 --- a/docs/benchmarks-1106.md +++ b/docs/benchmarks-1106.md @@ -71,9 +71,3 @@ The comments below only focus on comparing the `whole` edit format results: - The new `gpt-3.5-turbo-1106` model is completing the benchmark **3-4X faster** than the earlier GPT-3.5 models. - The success rate after the first try of 42% is comparable to the previous June (0613) model. The new November and previous June models are both worse than the original March (0301) model's 50% result on the first try. - The new model's 56% success rate after the second try seems comparable to the original March model, and somewhat better than the June model's 50% score. - - - -### Updates - -I will update the results on this page as quickly as my rate limit allows. diff --git a/docs/benchmarks-speed-1106.md b/docs/benchmarks-speed-1106.md index f14b5353d..2f7e85ab6 100644 --- a/docs/benchmarks-speed-1106.md +++ b/docs/benchmarks-speed-1106.md @@ -15,8 +15,8 @@ Aider relies on a to quantitatively evaluate performance. -This is the latest in a series of benchmarking reports -about the code +This is the latest in a series of reports +that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review previous reports to get more background on aider's benchmark suite: @@ -44,13 +44,10 @@ Some observations: OpenAI is enforcing very low rate limits on the new GPT-4 model. The rate limiting disrupts the the benchmarking process, -requiring it to be run single threaded, paused and restarted frequently. +requiring it to run single threaded, pause and restart frequently. These anomolous conditions make it slow to -benchmark the new model, and make comparisons against -the older versions less reliable. +benchmark the new model, and make +it less reliable to compare the results with +benchmark runs against the older model versions. Once the rate limits are relaxed I will do a clean run of the entire benchmark suite. - -### Updates - -I will update the results on this page as quickly as my rate limit allows.