This commit is contained in:
Paul Gauthier 2023-06-30 13:36:25 -07:00
parent 2ee31fd251
commit 2ce5576aa9

View file

@ -215,16 +215,16 @@ usually on the order of 3-6 different variations. This feels
like they are load balancing across a number of different like they are load balancing across a number of different
instances of the model. instances of the model.
For some exercises, some responses pass the unit tests and other For some exercises, some of this variable responses pass the unit tests and other
responses don't. responses do not.
Given that, it would be ideal to run all 133 exercises many times for each Given that, it would be ideal to run all 133 exercises many times for each
model + edit format combination and report an average performance. model/edit-format combination and report an average performance.
This would average away the effect of the API variance. This would average away the effect of the API variance.
That would also significantly increase the cost of this sort of benchmarking, That would also significantly increase the cost of this sort of benchmarking,
so I didn't do that. so I didn't do that.
Running 133 test cases provides some robustness all by itself, since Benchmarking against 133 exercises provides some robustness all by itself, since
we are measuring the performance across many exercises. we are measuring the performance across many exercises.
But to get a sense of how much the API variance impacts the benchmark outcomes, But to get a sense of how much the API variance impacts the benchmark outcomes,