copy

2025-05-28 16:25:00 +00:00 · 2023-11-07 18:18:02 -08:00 · 2023-11-07 18:18:02 -08:00 · c86a957cf5
commit c86a957cf5
parent c55aff87e6
2 changed files with 62 additions and 58 deletions
--- a/docs/benchmarks-1106.md
+++ b/docs/benchmarks-1106.md
@ -45,15 +45,19 @@ This is the edit format that aider uses by default with gpt-4.

 - The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up.
 - **It seems better at producing correct code on the first try**. It gets
-~57% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
+~54% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
 - The new model seems to perform similar
-(~66%) to the old models (63-64%) after being given a second chance to correct bugs by reviewing test suite error output.
+(~63%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.

 **These are preliminary results.**
 OpenAI is enforcing very low
-rate limits on the new GPT-4 model. The limits are so low, that
-I have only been able to attempt
-113
+rate limits on the new GPT-4 model.
+The rate limiting is disrupting the normal flow of the benchmarking process,
+which needs to be restarted after pauses.
+The benchmarking tool is capable of such restarts, but
+I will trust a "clean" run much better once the rate limits are relaxed.
+The results currently reflect
+130
 out of the 133 Exercism problems.
 The problems are selected in random order, so results should be *roughly*
 indicative of the full benchmark.