diff --git a/docs/benchmarks.md b/docs/benchmarks.md index 29ce4f1df..286c09ca3 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -215,16 +215,16 @@ usually on the order of 3-6 different variations. This feels like they are load balancing across a number of different instances of the model. -For some exercises, some responses pass the unit tests and other -responses don't. +For some exercises, some of this variable responses pass the unit tests and other +responses do not. Given that, it would be ideal to run all 133 exercises many times for each -model + edit format combination and report an average performance. +model/edit-format combination and report an average performance. This would average away the effect of the API variance. That would also significantly increase the cost of this sort of benchmarking, so I didn't do that. -Running 133 test cases provides some robustness all by itself, since +Benchmarking against 133 exercises provides some robustness all by itself, since we are measuring the performance across many exercises. But to get a sense of how much the API variance impacts the benchmark outcomes,