diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index 237b1ca52..478b1034a 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -196,7 +196,7 @@ def show_stats(dirnames): arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0.3"}, ) ax.annotate( - "Second attempt,\nincluding\nunit test errors", + "Second attempt,\nincluding unit\ntest error output", xy=(2.55, 56), xytext=(3.5, top), horizontalalignment="center", diff --git a/docs/benchmarks-1106.md b/docs/benchmarks-1106.md index ecf00ee22..2e15972de 100644 --- a/docs/benchmarks-1106.md +++ b/docs/benchmarks-1106.md @@ -35,13 +35,15 @@ With that in mind, I've been benchmarking the new models. ## gpt-4-1106-preview - The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models! I won't be able to properly quantify this until the rate limits loosen up. Currently I am seeing 10X faster responses. -- **It is better at producing correct code on the first try**. It gets ~59% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try. -- The new model seems to perform similarly to the old models after being given a chance to correct bugs by reviewing test suite error output. +- **It is better at producing correct code on the first try**. It gets ~60% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try. +- The new model seems to perform somewhat better (69%) than the old models (63-64%) after being given a chance to correct bugs by reviewing test suite error output. **These results are preliminiary.** OpenAI is enforcing very low rate limits on the new GPT-4 model. The limits are so low, that -I have only been able to attempt 56 out of 133 exercism problems. +I have only been able to attempt +58 +out of 133 exercism problems. They are randomly chosen, so results should be *roughly* indicative of the full benchmark.