diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index 60fa4947f..c659e4e28 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -56,8 +56,8 @@ More details on the benchmark, edit formats and results are discussed below.
 
 ## The benchmark
 
-The benchmark uses the 133
-[practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
+The benchmark uses 
+[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
 They were designed for people to learn and practice
 their python coding skills.
 
@@ -186,7 +186,7 @@ format requests original/updated edits to be returned using the function call AP
 ## GPT-3.5 hallucinates function calls?
 
 GPT-3.5 was very prone to ignoring the JSON Schema that specified valid functions,
-and would often return a completely invalid `function_call` fragment with `name="python"`.
+and would often return a completely invalid `function_call` fragment with `"name": "python"`.
 
 ```
         "function_call": {
@@ -200,7 +200,7 @@ with the arguments to the function specified in the `name` field.
 Instead, gpt-3.5 frequently just stuffed the entire python
 program into that field.
 
-It feels like it is getting confused with training done for ChatGPT plugins?
+It feels like it might be getting confused by fine tuning that was done for ChatGPT plugins?
 
 ## Limitations
 
@@ -211,8 +211,9 @@ like they are load balancing across a number of slightly different
 instances of the model.
 
 For some exercises, some of these variable responses pass the unit tests while
-other variants do not. Whether the exercises passes is therefore
-a bit random, depending on which variant OpenAI returns.
+other variants do not. Results for exercises like this which are
+"on the bubble" 
+are therefore a bit random, depending on which variant OpenAI returns.
 
 Given that, it would be ideal to run all 133 exercises many times for each
 model/edit-format combination and report an average performance.
@@ -224,7 +225,8 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
 we are measuring the performance across many exercises.
 
 But to get a sense of how much the API variance impacts the benchmark outcomes,
-I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times.
+I ran the all 133 exercises 10 times each
+against `gpt-3.5-turbo-0613` with the `whole` edit format.
 You'll see one set of error bars in the graph, which demark
 the range of results across those 10 runs.