diff --git a/assets/benchmarks-1106.svg b/assets/benchmarks-1106.svg new file mode 100644 index 000000000..3b8853edd --- /dev/null +++ b/assets/benchmarks-1106.svg @@ -0,0 +1,1868 @@ + + + + + + + + 2023-11-06T18:17:10.021773 + image/svg+xml + + + Matplotlib v3.8.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/benchmarks-1106.md b/docs/benchmarks-1106.md new file mode 100644 index 000000000..3f00da7e5 --- /dev/null +++ b/docs/benchmarks-1106.md @@ -0,0 +1,56 @@ +# GPT code editing benchmarks + +[![benchmark results](../assets/benchmarks.svg)](https://aider.chat/assets/benchmarks-1106.svg) + +Aider is an open source command line chat tool that lets you work with GPT to edit +code in your local git repo. +To do this, aider needs to be able to reliably recognize when GPT wants to edit local files, +determine which files it wants to modify and what changes to make. + +Doing a good job on this "code editing" task required a good LLM, good prompting and +a good tool driving the interactions with the LLM. +Aider uses a +[benchmark to quantitatively evaluate code editing](https://aider.chat/docs/benchmarks.html) +whenever one of these things changes. +For example, +whenever I change aider's prompting or the backend which drives LLM conversations, +I benchmark it to make sure these changes produce improvements (not regressions). + +The benchmark asks GPT to complete the +[Exercism Python coding exercises](). +Exercism provides a starting python file with stubs for the needed functions, +a natural language description of the problem to solve +and a test suite to evaluate whether the coder has correctly solved the problem. + +The benchmark that aider uses gives GPT two tries to complete the task: + +1. On the first try, GPT is given the stub code file to edit and the natural language instructions that describe the problem. +2. If the tests fail after the first try, GPT gets to see the test suite error output and try to fix the code. + +[OpenAI just released new versions of GPT-3.5 and GPT-4](https://openai.com/blog/new-models-and-developer-products-announced-at-devday), +and there's a lot +of interest about their ability to code compared to the previous versions. +With that in mind, I've been benchmarking the new models. + +## gpt-4-1106-preview + +- The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models! I won't be able to properly quantify this until the rate limits loosen up. Currently I am seeing 10X faster responses. +- **It is better at producing correct code on the first try**. It gets ~57% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try. +- The new model seems to perform similarly to the old models after being given a chance to correct bugs by reviewing test suite error output. + +**These results are preliminiary.** +OpenAI is enforcing very low +rate limits on the new GPT-4 model. The limits are so low, that +I have only been able to attempt 53 out of 133 exercism problems. +They are randomly chosen, so results should be *roughly* +indicative of the full benchmark. + +## gpt-3.5-turbo-1106 + +- The new `gpt-3.5-turbo-1106` model is completing the benchmark **3-4X faster** than the earlier GPT-3.5 models. +- The overall success rate after the first and second tries seems comparable to the earlier models. + + +## Updates + +I will continue to update the GPT-4 results as my rate limit allows.