From be0296318f247324857f73147d99268e5d00fd06 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 8 Nov 2023 10:53:03 -0800 Subject: [PATCH] speed results --- assets/benchmarks-speed-1106.svg | 1781 ++++++++++++++++++++++++++++++ docs/benchmarks-1106.md | 2 +- docs/benchmarks-speed-1106.md | 58 + 3 files changed, 1840 insertions(+), 1 deletion(-) create mode 100644 assets/benchmarks-speed-1106.svg create mode 100644 docs/benchmarks-speed-1106.md diff --git a/assets/benchmarks-speed-1106.svg b/assets/benchmarks-speed-1106.svg new file mode 100644 index 000000000..dc1ba67ea --- /dev/null +++ b/assets/benchmarks-speed-1106.svg @@ -0,0 +1,1781 @@ + + + + + + + + 2023-11-08T10:52:40.272159 + image/svg+xml + + + Matplotlib v3.8.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/benchmarks-1106.md b/docs/benchmarks-1106.md index 930de11d6..efc7f3a89 100644 --- a/docs/benchmarks-1106.md +++ b/docs/benchmarks-1106.md @@ -1,4 +1,4 @@ -# Code editing benchmarks for OpenAI's "1106" models +# Code editing skill benchmarks for OpenAI's "1106" models [![benchmark results](../assets/benchmarks-1106.svg)](https://aider.chat/assets/benchmarks-1106.svg) diff --git a/docs/benchmarks-speed-1106.md b/docs/benchmarks-speed-1106.md new file mode 100644 index 000000000..cc974f086 --- /dev/null +++ b/docs/benchmarks-speed-1106.md @@ -0,0 +1,58 @@ +# Code editing speed benchmarks for OpenAI's "1106" models + +[![benchmark results](../assets/benchmarks-speed-1106.svg)](https://aider.chat/assets/benchmarks-speed-1106.svg) + +[OpenAI just released new versions of GPT-3.5 and GPT-4](https://openai.com/blog/new-models-and-developer-products-announced-at-devday), +and there's a lot +of interest about their capabilities and performance. +With that in mind, I've been benchmarking the new models. + +[Aider](https://github.com/paul-gauthier/aider) +is an open source command line chat tool that lets you work with GPT to edit +code in your local git repo. +Aider relies on a +[code editing benchmark](https://aider.chat/docs/benchmarks.html) +to quantitatively evaluate +performance. + +This is the latest in a series of benchmarking reports +about the code +editing capabilities of OpenAI's GPT models. You can review previous +reports to get more background on aider's benchmark suite: + +- [GPT code editing benchmarks](https://aider.chat/docs/benchmarks.html) evaluates the March and June versions of GPT-3.5 and GPT-4. +- [Code editing skill benchmarks for OpenAI's "1106" models](https://aider.chat/docs/benchmarks-1106.html) compares the olders models to the November (1106) models. + +## Speed + +This report compares the **speed** of the various GPT models. +Aider's benchmark measures the response time of the OpenAI chat completion +endpoint each time it asks GPT to solve a programming exercise in the benchmark +suite. These results measure only the time spent waiting for OpenAI to +respond to the prompt. +So they are measuring +how fast these models can +generate responses which primarily consist of source code. + +Some observations: + +- **GPT-3.5 got 6-11x faster.** The `gpt-3.5-turbo-1106` model is 6-11x faster than the June (0613) version which has been the default `gpt-3.5-turbo` model. +- **GPT-4 Turbo is 4-5x faster.** The new `gpt-4-1106-preview` model is 4-5x faster than the June (0613) version which has been the default `gpt-4` model. +- The old March (0301) version of GPT-3.5 is actually faster than the June (0613) version. This was a surprising discovery. + +### Preliminary results + +**These are preliminary results.** +OpenAI is enforcing very low +rate limits on the new GPT-4 model. +The rate limiting disrupts the the benchmarking process, +requiring it to be run single threaded, paused and restarted frequently. +These anomolous conditions make it slow to +benchmark the new model, and make comparisons against +the older versions less reliable. +Once the rate limits are relaxed I will do a clean +run of the entire benchmark suite. + +### Updates + +I will update the results on this page as quickly as my rate limit allows.