This commit is contained in:
Paul Gauthier 2023-07-02 08:06:17 -07:00
parent cef990cd98
commit 93e29eda94

View file

@ -5,28 +5,20 @@
Aider is an open source command line chat tool that lets you work with GPT to edit Aider is an open source command line chat tool that lets you work with GPT to edit
code in your local git repo. code in your local git repo.
You can use aider to have GPT add features, write tests or make other changes to your code.
To do this, aider needs to be able to reliably recognize when GPT wants to edit local files, To do this, aider needs to be able to reliably recognize when GPT wants to edit local files,
determine which files to modify and what edits to apply. determine which files it wants to modify and what changes to save.
This direct read/modify/write integration allows Such automated
users to harness GPT's coding skills without code editing hinges on the "edit format" portion of the system prompt, which specifies
needing to repeatedly copy & paste
code back and forth between their files and a ChatGPT window.
Successful automated
code editing hinges on the "edit format", which specifies
how GPT should structure code edits in its responses. how GPT should structure code edits in its responses.
Aider instructs GPT to use a specific
edit format as part of the system prompt.
Aider currently uses simple text based editing formats, but Aider currently uses simple text based editing formats, but
[OpenAI's new function calling [OpenAI's new function calling
API](https://openai.com/blog/function-calling-and-other-api-updates) API](https://openai.com/blog/function-calling-and-other-api-updates)
looks like a promising way to create more structured edit formats. looks like a promising way to create more structured edit formats.
Before making such a big change, I wanted I wanted
a quantitative way to assess the benefits a quantitative way to assess the potential benefits
of function based editing. of switching aider to function based editing.
With this in mind, I developed a With this in mind, I developed a
benchmark based on the [Exercism benchmark based on the [Exercism
@ -34,14 +26,15 @@ python](https://github.com/exercism/python) coding exercises.
This This
benchmark evaluates how effectively aider and GPT can translate a benchmark evaluates how effectively aider and GPT can translate a
natural language coding request into executable code saved into natural language coding request into executable code saved into
files that pass unit tests. It's an end-to-end evaluation of not just files that pass unit tests.
It provides an end-to-end evaluation of not just
GPT's coding ability, but also its capacity to *edit existing code* GPT's coding ability, but also its capacity to *edit existing code*
and *format those code edits* so that aider can save the and *format those code edits* so that aider can save the
edits to the local source files. edits to the local source files.
I ran this code editing benchmark I ran the benchmark
on all the ChatGPT models except `gpt-4-32k`, using a variety of edit formats. on all the ChatGPT models (except `gpt-4-32k`), using a variety of edit formats.
The results were quite interesting: The results were interesting:
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results for this `whole` edit format are shown in solid blue in the graph. - Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results for this `whole` edit format are shown in solid blue in the graph.
- Using the new functions API for edits performed worse than the above whole file method, for all the models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results for these `...-func` edit methods are shown as patterned bars in the graph (both green and blue). - Using the new functions API for edits performed worse than the above whole file method, for all the models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results for these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
@ -62,7 +55,7 @@ Or should they type up a properly escaped and
syntactically correct json data structure syntactically correct json data structure
that contains the text of the new code? that contains the text of the new code?
Using more complex output formats with GPT seems to introduce two issues: Using more complex output formats with GPT seems to cause two issues:
- It makes GPT write worse code. Keeping the output format simple seems to allow GPT to devote more attention to the actual coding task. - It makes GPT write worse code. Keeping the output format simple seems to allow GPT to devote more attention to the actual coding task.
- It reduces GPT's adherence to the output format, making it more challenging for tools like aider to accurately identify and apply the edits GPT is attempting to make. - It reduces GPT's adherence to the output format, making it more challenging for tools like aider to accurately identify and apply the edits GPT is attempting to make.