From 09a220f7fbd3a4b79e64a291fbb695e70397a55f Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Sat, 1 Jul 2023 09:49:01 -0700
Subject: [PATCH] copy

---
 docs/benchmarks.md | 42 +++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index cf851b0ef..ba5c7c59b 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -18,8 +18,8 @@ specifying how GPT should format code edits in its replies.
 Different edit formats can range in
 complexity from something simple like "return an updated copy of the whole file" to
 a much more sophisticated format 
-that uses the
-[function calling API](https://openai.com/blog/function-calling-and-other-api-updates)
+that uses
+[OpenAI's new function calling API](https://openai.com/blog/function-calling-and-other-api-updates)
 to specify a series of specific diffs
 
 To measure the impact of changes to the edit format,
@@ -130,7 +130,7 @@ Sometimes it just writes the wrong code.
 Other times, 
 it fails to format the code edits in a way that conforms to the edit format so the code isn't saved properly.
 
-It's worth keeping in mind that changing the edit format often affects both aspects of GPT's performance on the exercises.
+It's worth keeping in mind that changing the edit format often affects both aspects of GPT's performance.
 Complex edit formats often make it write worse code *and* make it less successful at formatting the edits correctly.
 
 
@@ -170,13 +170,6 @@ Each edit is a fenced code block that
 specifies the filename and a chunk of ORIGINAL and UPDATED code.
 GPT provides some original lines from the file and then a new updated set of lines.
 
-While GPT-3.5 is sometimes able to generate this `diff` edit format,
-it often uses it in a pathological way.
-It puts the *entire* original source file in the ORIGINAL block
-and the entire updated file in the UPDATED block.
-This is strictly worse than just using the `whole` edit format,
-since GPT is sending 2 full copies of the file.
-
 ````
 Here are the changes you requested to demo.py:
 
@@ -231,10 +224,28 @@ original/updated style edits to be returned using the function call API.
 }       
 ```
 
-## GPT-3.5 hallucinates function calls
+## GPT-3.5 struggles with complex edit formats
 
-GPT-3.5 is prone to ignoring the JSON Schema that specifies valid functions,
-and often returns a completely novel and semantically
+While GPT-3.5 is able to pass some exercises using
+edit formats other than the `whole` format,
+it really struggles with the rest of the formats.
+
+### Pathlogical use of `diff`
+
+While GPT-3.5 is sometimes able to
+correctly generate the `diff` edit format,
+it often uses it in a pathological way.
+
+It places the *entire* original source file in the ORIGINAL block
+and the entire updated file in the UPDATED block.
+This is strictly worse than just using the `whole` edit format,
+since GPT is sending 2 full copies of the file.
+
+### Hallucinating function calls
+
+When using the functions API
+GPT-3.5 is prone to ignoring the JSON Schema that specifies valid functions.
+It often returns a completely novel and semantically
 invalid `function_call` fragment with `"name": "python"`.
 
 ```
@@ -249,7 +260,8 @@ with the arguments to the function specified in the `name` field.
 Instead, GPT-3.5 frequently just stuffs an entire python
 file into that field.
 
-It feels like it might be getting confused by fine tuning that was done for ChatGPT plugins?
+It feels like it might be getting confused by fine tuning that was done
+for the ChatGPT coder interpreter plugin?
 
 ## Randomness
 
@@ -260,7 +272,7 @@ when sending test error output to GPT
 it removes the wall-clock timing information that
 is normally included by the `unittest` module.
 
-The benchmarking harness also logs sha hashes of the
+The benchmarking harness also logs sha hashes of
 all the OpenAI API requests and replies.
 This makes it possible to
 detect randomness or nondeterminism