This commit is contained in:
Paul Gauthier 2023-06-30 14:33:45 -07:00
parent 6610a8310c
commit 8c73a7be35

View file

@ -35,8 +35,8 @@ I ran the benchmark
on almost all the ChatGPT models, using a variety of edit formats.
This produced some interesting observations:
- Asking GPT to just return an updated copy of the whole file in a normal fenced code block is by far the most reliable way to have it edit code. This is true across all gpt-3.5 and gpt-4 models. Keeping the output format dead simple seems to leave GPT with more brain power to devote to the actual coding task. GPT is also less likely to mangle this simple output format.
- Using the new function calling API is worse than returning whole files in markdown. GPT writes worse code and frequently mangles the output format, even though OpenAI introduced the function calling API to make structured output formatting more reliable. This was a big surprise.
- Asking GPT to just return an updated copy of the whole file in a normal fenced code block is by far the most reliable edit format. This is true across all gpt-3.5 and gpt-4 models. Keeping the output format dead simple seems to leave GPT with more brain power to devote to the actual coding task. GPT is also less likely to mangle this simple output format.
- Using the new function calling API is worse than the above whole file method, for all models. GPT writes worse code and frequently mangles this output format, even though OpenAI introduced the function calling API to make structured output formatting more reliable. This was a big surprise.
- The new June (`0613`) versions of `gpt-3.5-turbo` are worse at code editing than the older Feb (`0301`) version. This was unexpected.
- The gpt-4 models are much better at code editing than the gpt-3.5 models. This was expected.