This commit is contained in:
Paul Gauthier 2023-06-30 14:11:26 -07:00
parent 108328b4be
commit ae9df00043

View file

@ -56,8 +56,8 @@ More details on the benchmark, edit formats and results are discussed below.
## The benchmark ## The benchmark
The benchmark uses the 133 The benchmark uses
[practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice). [133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
They were designed for people to learn and practice They were designed for people to learn and practice
their python coding skills. their python coding skills.
@ -186,7 +186,7 @@ format requests original/updated edits to be returned using the function call AP
## GPT-3.5 hallucinates function calls? ## GPT-3.5 hallucinates function calls?
GPT-3.5 was very prone to ignoring the JSON Schema that specified valid functions, GPT-3.5 was very prone to ignoring the JSON Schema that specified valid functions,
and would often return a completely invalid `function_call` fragment with `name="python"`. and would often return a completely invalid `function_call` fragment with `"name": "python"`.
``` ```
"function_call": { "function_call": {
@ -200,7 +200,7 @@ with the arguments to the function specified in the `name` field.
Instead, gpt-3.5 frequently just stuffed the entire python Instead, gpt-3.5 frequently just stuffed the entire python
program into that field. program into that field.
It feels like it is getting confused with training done for ChatGPT plugins? It feels like it might be getting confused by fine tuning that was done for ChatGPT plugins?
## Limitations ## Limitations
@ -211,8 +211,9 @@ like they are load balancing across a number of slightly different
instances of the model. instances of the model.
For some exercises, some of these variable responses pass the unit tests while For some exercises, some of these variable responses pass the unit tests while
other variants do not. Whether the exercises passes is therefore other variants do not. Results for exercises like this which are
a bit random, depending on which variant OpenAI returns. "on the bubble"
are therefore a bit random, depending on which variant OpenAI returns.
Given that, it would be ideal to run all 133 exercises many times for each Given that, it would be ideal to run all 133 exercises many times for each
model/edit-format combination and report an average performance. model/edit-format combination and report an average performance.
@ -224,7 +225,8 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
we are measuring the performance across many exercises. we are measuring the performance across many exercises.
But to get a sense of how much the API variance impacts the benchmark outcomes, But to get a sense of how much the API variance impacts the benchmark outcomes,
I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times. I ran the all 133 exercises 10 times each
against `gpt-3.5-turbo-0613` with the `whole` edit format.
You'll see one set of error bars in the graph, which demark You'll see one set of error bars in the graph, which demark
the range of results across those 10 runs. the range of results across those 10 runs.