diff --git a/docs/benchmarks.md b/docs/benchmarks.md index 45f3a2df1..0f34d2971 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -43,7 +43,7 @@ The results were quite interesting: - Using the new function calling API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `func` edit methods are shown as patterned bars in the graph (both green and blue). - As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing. -The quantitative benchmark results align with my developing intuition +The quantitative benchmark results align with my intuitions about prompting GPT for complex tasks like coding. It's beneficial to minimize the "cognitive overhead" of formatting the response, allowing GPT to concentrate on the task at hand. As an analogy, asking a junior