diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index 635e9e102..b77aa9751 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -217,20 +217,21 @@ usually on the order of 3-6 different variations. This feels
 like they are load balancing across a number of slightly different
 instances of the model.
 
-For some exercises, some of this variable responses pass the unit tests and other
-responses do not.
+For some exercises, some of these variable responses pass the unit tests while
+other variants do not. Whether the exercises passes is therefore
+a bit random, depending on which variant OpenAI returns.
 
 Given that, it would be ideal to run all 133 exercises many times for each
 model/edit-format combination and report an average performance.
 This would average away the effect of the API variance.
-That would also significantly increase the cost of this sort of benchmarking,
+It would also significantly increase the cost of this sort of benchmarking,
 so I didn't do that.
 
 Benchmarking against 133 exercises provides some robustness all by itself, since
 we are measuring the performance across many exercises.
 
 But to get a sense of how much the API variance impacts the benchmark outcomes,
-I ran the `gpt-3.5-turbo-0613 + whole` experiment 10 times.
+I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times.
 You'll see one set of error bars in the graph, which demark
 the range of results across those 10 runs.
 
@@ -239,7 +240,7 @@ contribute to a large variance in the benchmark results.
 
 ## Conclusions
 
-Based on these benchmarking results, aider will continue to usea
+Based on these benchmarking results, aider will continue to use
 `whole` for gpt-3.5 and `diff` for gpt-4.
 While `gpt-4` gets slightly better results with the `whole` edit format,
 it significantly increases costs and latency compared to `diff`.
@@ -247,4 +248,3 @@ Since `gpt-4` is already costly and slow, this seems like an acceptable
 tradeoff.
 
 
-