This commit is contained in:
Paul Gauthier 2023-06-30 14:36:35 -07:00
parent 8c73a7be35
commit c2b1bc7e07

View file

@ -225,7 +225,7 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
we are measuring the performance across many exercises. we are measuring the performance across many exercises.
But to get a sense of how much the API variance impacts the benchmark outcomes, But to get a sense of how much the API variance impacts the benchmark outcomes,
I ran the all 133 exercises 10 times each I ran all 133 exercises 10 times each
against `gpt-3.5-turbo-0613` with the `whole` edit format. against `gpt-3.5-turbo-0613` with the `whole` edit format.
You'll see one set of error bars in the graph, which demark You'll see one set of error bars in the graph, which demark
the range of results across those 10 runs. the range of results across those 10 runs.