From c55aff87e684c94df08c0479ff028969b2b4efb6 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Tue, 7 Nov 2023 14:25:46 -0800 Subject: [PATCH] copy --- docs/benchmarks-1106.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/benchmarks-1106.md b/docs/benchmarks-1106.md index 7667c2a96..1906781a0 100644 --- a/docs/benchmarks-1106.md +++ b/docs/benchmarks-1106.md @@ -44,15 +44,17 @@ For now, I have only benchmarked the GPT-4 models using the `diff` edit method. This is the edit format that aider uses by default with gpt-4. - The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up. -- **It seems better at producing correct code on the first try**. It gets ~56% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try. -- The new model seems to perform similar (~66%) to the old models (63-64%) after being given a second chance to correct bugs by reviewing test suite error output. +- **It seems better at producing correct code on the first try**. It gets +~57% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try. +- The new model seems to perform similar +(~66%) to the old models (63-64%) after being given a second chance to correct bugs by reviewing test suite error output. **These are preliminary results.** OpenAI is enforcing very low rate limits on the new GPT-4 model. The limits are so low, that I have only been able to attempt -110 -out of 133 exercism problems. +113 +out of the 133 Exercism problems. The problems are selected in random order, so results should be *roughly* indicative of the full benchmark.