copy

2025-05-28 16:25:00 +00:00 · 2023-06-30 14:11:26 -07:00 · 2023-06-30 14:11:26 -07:00 · ae9df00043
commit ae9df00043
parent 108328b4be
1 changed files with 9 additions and 7 deletions
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@ -56,8 +56,8 @@ More details on the benchmark, edit formats and results are discussed below.

 ## The benchmark

-The benchmark uses the 133
-[practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
+The benchmark uses 
+[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
 They were designed for people to learn and practice
 their python coding skills.

@ -186,7 +186,7 @@ format requests original/updated edits to be returned using the function call AP
 ## GPT-3.5 hallucinates function calls?

 GPT-3.5 was very prone to ignoring the JSON Schema that specified valid functions,
-and would often return a completely invalid `function_call` fragment with `name="python"`.
+and would often return a completely invalid `function_call` fragment with `"name": "python"`.

 ```
        "function_call": {
@ -200,7 +200,7 @@ with the arguments to the function specified in the `name` field.
 Instead, gpt-3.5 frequently just stuffed the entire python
 program into that field.

-It feels like it is getting confused with training done for ChatGPT plugins?
+It feels like it might be getting confused by fine tuning that was done for ChatGPT plugins?

 ## Limitations

@ -211,8 +211,9 @@ like they are load balancing across a number of slightly different
 instances of the model.

 For some exercises, some of these variable responses pass the unit tests while
-other variants do not. Whether the exercises passes is therefore
-a bit random, depending on which variant OpenAI returns.
+other variants do not. Results for exercises like this which are
+"on the bubble" 
+are therefore a bit random, depending on which variant OpenAI returns.

 Given that, it would be ideal to run all 133 exercises many times for each
 model/edit-format combination and report an average performance.
@ -224,7 +225,8 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
 we are measuring the performance across many exercises.

 But to get a sense of how much the API variance impacts the benchmark outcomes,
-I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times.
+I ran the all 133 exercises 10 times each
+against `gpt-3.5-turbo-0613` with the `whole` edit format.
 You'll see one set of error bars in the graph, which demark
 the range of results across those 10 runs.