From 8f73f8b6519f6e4156550aff53b67463340b96dd Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Sat, 1 Jul 2023 13:02:30 -0700
Subject: [PATCH] copy

---
 docs/benchmarks.md | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index ba5c7c59b..4f8dfa56d 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -56,11 +56,13 @@ changes as `diff -c` formatted edits.
 Using more complex output formats seems to cause two problems:
 
   - It makes GPT write worse code. Keeping the output format simple seems to leave GPT with more attention to devote to the actual coding task.
-  - It makes GPT less likely to adhere to the output format. This makes it harder for tooling to correctly identify and apply the edits it is trying to make. 
+  - It makes GPT less likely to adhere to the output format. This makes it harder for tooling like aider to correctly identify and apply the edits GPT is trying to make. 
 
 I had hoped that the new function calling API would enable more reliable use of
-structured output formats, but it does not appear to be a panacea
-when working with source code.
+structured output formats, and expected to switch aider to using it
+for both GPT-3.5 and GPT-4.
+But given these benchmarking results, I won't be adopting the functions api
+at this time.
 
 More details on the benchmark, edit formats and results are discussed below.
 
@@ -116,8 +118,10 @@ Many of the exercises have multiple paragraphs of instructions,
 and most human coders would likely fail some tests on their
 first try.
 
-It's worth noting that GPT never gets to see the source code of the unit tests.
+It's worth noting that GPT never gets to see the source code of the unit tests
+during the benchmarking.
 Just the error output from failed tests.
+Of course, all of this code was probably part of its original training data!
 
 In summary, passing an exercise means GPT was able to:
 
@@ -261,7 +265,7 @@ Instead, GPT-3.5 frequently just stuffs an entire python
 file into that field.
 
 It feels like it might be getting confused by fine tuning that was done
-for the ChatGPT coder interpreter plugin?
+for the ChatGPT code interpreter plugin?
 
 ## Randomness