This commit is contained in:
Paul Gauthier 2023-06-30 14:36:35 -07:00
parent 8c73a7be35
commit c2b1bc7e07

View file

@ -225,7 +225,7 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
we are measuring the performance across many exercises.
But to get a sense of how much the API variance impacts the benchmark outcomes,
I ran the all 133 exercises 10 times each
I ran all 133 exercises 10 times each
against `gpt-3.5-turbo-0613` with the `whole` edit format.
You'll see one set of error bars in the graph, which demark
the range of results across those 10 runs.