mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-31 17:55:01 +00:00
copy
This commit is contained in:
parent
5e82455c85
commit
adf6929669
1 changed files with 29 additions and 18 deletions
|
@ -7,11 +7,17 @@ Aider is an open source command line chat tool that lets you work with GPT to ed
|
||||||
code in your local git repo.
|
code in your local git repo.
|
||||||
You can use aider to have GPT add features, write tests or make other changes to your code.
|
You can use aider to have GPT add features, write tests or make other changes to your code.
|
||||||
|
|
||||||
The ability for GPT to reliably edit local source files is
|
To do this, aider needs to be able to reliably recognize when GPT wants to edit local files,
|
||||||
crucial for this functionality, and depends mainly on the "edit format".
|
determine which files to modify and what edits to apply.
|
||||||
The edit format is an important component of the system prompt,
|
Without this direct read/modify/write integration,
|
||||||
which specifies how GPT should structure code edits in its
|
users would need to repeatedly copy/paste
|
||||||
responses.
|
code back and forth between their files and a ChatGPT window.
|
||||||
|
|
||||||
|
Successful automated
|
||||||
|
code editing hinges on the "edit format", which specifies
|
||||||
|
how GPT should structure code edits in its responses.
|
||||||
|
Aider instructs GPT to use a specific
|
||||||
|
edit format as part of the system prompt.
|
||||||
|
|
||||||
Aider currently uses simple text based editing formats, but
|
Aider currently uses simple text based editing formats, but
|
||||||
[OpenAI's new function calling
|
[OpenAI's new function calling
|
||||||
|
@ -26,35 +32,40 @@ benchmark based on the [Exercism
|
||||||
python](https://github.com/exercism/python) coding exercises.
|
python](https://github.com/exercism/python) coding exercises.
|
||||||
This
|
This
|
||||||
benchmark evaluates how effectively aider and GPT can translate a
|
benchmark evaluates how effectively aider and GPT can translate a
|
||||||
natural language coding request into actual runnable code saved into
|
natural language coding request into actual executable code saved into
|
||||||
files that pass unit tests. It's an end-to-end evaluation of not just
|
files that pass unit tests. It's an end-to-end evaluation of not just
|
||||||
GPT's code writing ability, but also its capacity to *edit existing
|
GPT's code writing ability, but also its capacity to *edit existing code*
|
||||||
code* and *package those code changes* so that aider can save the
|
and *format those code edits* so that aider can save the
|
||||||
edits to the local source files.
|
edits to the local source files.
|
||||||
|
|
||||||
I ran this code editing benchmark
|
I ran this code editing benchmark
|
||||||
on almost all the ChatGPT models, using a variety of edit formats.
|
on all the ChatGPT models except `gpt-4-32k`, using a variety of edit formats.
|
||||||
The results were quite interesting:
|
The results were quite interesting:
|
||||||
|
|
||||||
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph.
|
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph.
|
||||||
- Using the new functions API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
|
- Using the new functions API for edits performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
|
||||||
- The performance of the new June (`0613`) version of GPT-3.5 appears to be a bit worse than the February (`0301`) version. This is visible if you look at the "first coding attempt" markers on the first three blue bars.
|
- The performance of the new June (`0613`) versions of GPT-3.5 appears to be a bit worse than the February (`0301`) version. This is visible if you look at the "first coding attempt" markers on the first three blue bars and also by comparing the first three green `diff` bars.
|
||||||
- As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing.
|
- As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing.
|
||||||
|
|
||||||
The quantitative benchmark results align with my intuitions
|
The quantitative benchmark results align with my intuitions
|
||||||
about prompting GPT for complex tasks like coding. It's beneficial to
|
about prompting GPT for complex tasks like coding. It's beneficial to
|
||||||
minimize the "cognitive overhead" of formatting the response, allowing
|
minimize the "cognitive overhead" of formatting the response, allowing
|
||||||
GPT to concentrate on the task at hand. As an analogy, imagine asking a junior
|
GPT to concentrate on the task at hand.
|
||||||
developer to implement a new feature by manually typing the required
|
As an analogy, imagine a slack conversation with a junior developer where
|
||||||
code changes as `diff -c` formatted edits. You wouldn't expect a good result.
|
you ask them to give you the code for some new feature.
|
||||||
|
Will they produce better code if you ask them to type out the
|
||||||
|
code and wrap it in a triple-backtick block?
|
||||||
|
Or if you ask them to encode their code edits within a properly escaped and
|
||||||
|
syntactically correct
|
||||||
|
json data structure?
|
||||||
|
|
||||||
Using more complex output formats seems to introduce two issues:
|
Using more complex output formats with GPT seems to introduce two issues:
|
||||||
|
|
||||||
- It makes GPT write worse code. Keeping the output format simple appears to allow GPT to devote more attention to the actual coding task.
|
- It makes GPT write worse code. Keeping the output format simple seems to allow GPT to devote more attention to the actual coding task.
|
||||||
- It reduces GPT's adherence to the output format, making it more challenging for tools like aider to accurately identify and apply the edits GPT is attempting to make.
|
- It reduces GPT's adherence to the output format, making it more challenging for tools like aider to accurately identify and apply the edits GPT is attempting to make.
|
||||||
|
|
||||||
I was planning to start using a function call based edit format in aider for both GPT-3.5 and GPT-4.
|
I was expecting to start using function call based edits in aider for both GPT-3.5 and GPT-4.
|
||||||
But given these benchmarking results, I won't be adopting the functions API
|
But given these benchmark results, I won't be adopting the functions API
|
||||||
at this time.
|
at this time.
|
||||||
|
|
||||||
More details on the benchmark, edit formats and results are discussed below.
|
More details on the benchmark, edit formats and results are discussed below.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue