This commit is contained in:
Paul Gauthier 2023-07-01 17:00:57 -07:00
parent 3f0c752368
commit 7db384fc63

View file

@ -41,6 +41,7 @@ The results were quite interesting:
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph. - Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph.
- Using the new function calling API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue). - Using the new function calling API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
- The performance of the June (`0613`) version of GPT-3.5 appears to be a bit worse than the Feb (`0301`) version. This is visible if you look at the "first coding attempt" markers on the blue bars.
- As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing. - As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing.
The quantitative benchmark results align with my intuitions The quantitative benchmark results align with my intuitions
@ -226,27 +227,42 @@ original/updated style edits to be returned using the function call API.
} }
``` ```
## GPT-3.5 struggles with complex edit formats ## GPT-3.5's performance
While GPT-3.5 can pass some exercises using edit formats other than ### The `0613` models seem worse?
the `whole` format, it struggles with the rest of the
formats.
### Pathlogical use of `diff` The benchmark results have me fairly convinced that the new
`gpt-3.5-turbo-0613` and `gpt-3.5-16k-0613` models
are a bit worse at code editing than
the older `gpt-3.5-turbo-0301` model.
This is especially visible if you look at just the "first coding attempt"
portion of each result, before GPT gets a second chance to edit the code.
Performance with the `whole` edit format was 46% for the
February model and only 39% for the June models.
I saw other signs of this degraded performance
in earlier versions of the
benchmark as well.
While GPT-3.5 can sometimes correctly generate the `diff` edit format, ### Pathological use of `diff`
When GPT-3.5 is able to correctly generate the `diff` edit format,
it often uses it in a pathological manner. It places the *entire* it often uses it in a pathological manner. It places the *entire*
original source file in the ORIGINAL block and the entire updated file original source file in the ORIGINAL block and the entire updated file
in the UPDATED block. This is strictly worse than just using the in the UPDATED block. This is strictly worse than just using the
`whole` edit format, as GPT is sending 2 full copies of the file. `whole` edit format, as GPT is sending 2 full copies of the file.
### Hallucinating function calls ### Hallucinated function calls
When using the functions API When GPT-3.5 uses the functions API
GPT-3.5 is prone to ignoring the JSON Schema that specifies valid functions. it is prone to ignoring the JSON Schema that specifies valid functions.
It often returns a completely novel and semantically It often returns a completely novel and semantically
invalid `function_call` fragment with `"name": "python"`. invalid `function_call` fragment with `"name": "python"`.
The `arguments` attribute is supposed to be a set of key/value pairs
with the arguments to the function specified in the `name` field.
Instead, GPT-3.5 frequently just stuffs an entire python
file into that field.
``` ```
"function_call": { "function_call": {
"name": "python", "name": "python",
@ -254,15 +270,12 @@ invalid `function_call` fragment with `"name": "python"`.
}, },
``` ```
The `arguments` attribute is supposed to be a set of key/value pairs
with the arguments to the function specified in the `name` field.
Instead, GPT-3.5 frequently just stuffs an entire python
file into that field.
It seems like it might be getting confused by fine-tuning that was It seems like it might be getting confused by fine-tuning that was
done for the ChatGPT code interpreter plugin? done for the ChatGPT code interpreter plugin?
## Randomness ## Randomness
The benchmark attempts to be deterministic, always sending identical The benchmark attempts to be deterministic, always sending identical