This commit is contained in:
Paul Gauthier 2023-12-17 13:36:19 -08:00
parent 99abb25f79
commit 7ab3b99914

View file

@ -4,15 +4,18 @@
![robot flowchart](../assets/udiffs.jpg) ![robot flowchart](../assets/udiffs.jpg)
Aider now asks GPT-4 Turbo to use [unified diffs](https://www.gnu.org/software/diffutils/manual/html_node/Unified-Format.html) to edit your code when you request new features, improvements, bug fixes, test cases, etc. Aider now asks GPT-4 Turbo to use
This new support for unified diffs massively reduces GPT-4 Turbo's habit of being a "lazy" coder. [unified diffs](https://www.gnu.org/software/diffutils/manual/html_node/Example-Unified.html)
to edit your code when you request new features, improvements, bug fixes, test cases, etc.
This new support for unified diffs massively reduces GPT-4 Turbo's bad habit of "lazy" coding.
There are abundant anecdotes There are abundant anecdotes
about GPT-4 Turbo writing half completed code filled with comments that give about GPT-4 Turbo writing half completed code filled with comments that give
homework assignments to the user homework assignments to the user
like "...omitted for brevity..." or "...add logic here...". like "...omitted for brevity..." or "...add logic here...".
Aider's new unified diff edit format significantly reduces this sort of lazy coding, Aider's new unified diff edit format significantly reduces this sort of lazy coding,
producing much better quantitative scores on a new "laziness benchmark". as quantified by dramatically improved scores
on a new "laziness benchmark".
Before trying to reduce laziness, I needed a way to quantify and measure Before trying to reduce laziness, I needed a way to quantify and measure
the problem. the problem.
@ -35,21 +38,22 @@ This new laziness benchmark produced the following results with `gpt-4-1106-prev
- **A system prompt based on widely circulated folklore only scored 15%, same as the baseline.** This experiment used the existing "SEARCH/REPLACE block" format with an additional prompt that claims the user is blind, has no hands, will tip $2000 and has suffered from "truncated code trauma". - **A system prompt based on widely circulated folklore only scored 15%, same as the baseline.** This experiment used the existing "SEARCH/REPLACE block" format with an additional prompt that claims the user is blind, has no hands, will tip $2000 and has suffered from "truncated code trauma".
The older `gpt-4-0613` also did better on the laziness benchmark by using unified diffs. The older `gpt-4-0613` also did better on the laziness benchmark by using unified diffs.
The benchmark was designed to work with large source code files, many of The benchmark was designed to work with large source code files, and
which exceeded GPT-4's 8k context window. many of them are too large to use with June GPT-4.
This meant that 28% of tasks exhausted the context window and were marked as a fail, **About 28% of the tasks exhausted the 8k context window** and were automatically
significantly dragging down GPT-4's performance on the benchmark. marked as failures,
significantly dragging down June GPT-4's performance on the benchmark.
- **GPT-4's baseline was 26%** using aider's existing "SEARCH/REPLACE block" edit format. - **The June GPT-4's baseline was 26%** using aider's existing "SEARCH/REPLACE block" edit format.
- **Aider's new unified diff edit format raised GPT-4's score to 59%**. - **Aider's new unified diff edit format raised June GPT-4's score to 59%**.
Before settling on unified diffs, Before settling on unified diffs,
I explored many other approaches to stop GPT-4 Turbo from eliding code I explored many other approaches to stop GPT-4 Turbo from eliding code
and replacing it with comments. and replacing it with comments.
These efforts included prompts about being tireless and diligent, These efforts included prompts about being tireless and diligent,
use of OpenAI's function/tool calling capabilities and numerous variations on use of OpenAI's function/tool calling capabilities and numerous variations on
aider's existing editing formats and other diff-like formats. aider's existing editing formats, line number formats and other diff-like formats.
All in all, the results shared here reflect The results shared here reflect
an extensive investigation of possible solutions and an extensive investigation of possible solutions and
a large number of benchmarking runs of numerous varied approaches against a large number of benchmarking runs of numerous varied approaches against
GPT-4 Turbo. GPT-4 Turbo.
@ -76,14 +80,14 @@ referencing old code like
"...copy $USD formatting code here...". "...copy $USD formatting code here...".
Based on this observation, I set out to build a benchmark based on refactoring Based on this observation, I set out to build a benchmark based on refactoring
a non-trivial amount of code from within fairly large source files. a non-trivial amount of code found in fairly large source files.
To do this, I used python's `ast` module to analyze the To do this, I used python's `ast` module to analyze the
[Django repository](). [Django repository](https://github.com/django/django).
The goal was to search the Django repository to: The goal was to search the Django repository to:
- Find source files that contain class methods which are non-trivial, having more than 100 AST nodes in their implementation. - Find source files that contain class methods which are non-trivial, having more than 100 AST nodes in their implementation.
- Focus on methods that are a smaller piece of a larger class, so they don't represent the bulk of the code in their class or the file. We want to find methods which are less than half the AST nodes present in their containing class. - Focus on methods that are part of a larger class. We want to find methods which are less than half the code present in their containing class.
- Find methods that do not make any use of their `self` parameter. This means they can be trivially refactored out of the class and turned into a stand-alone top-level function. - Find methods that do not make any use of their `self` parameter. This means they can be trivially refactored out of the class and turned into a stand-alone top-level function.
We can then turn each of these source files into a task for the benchmark, We can then turn each of these source files into a task for the benchmark,
@ -112,7 +116,8 @@ And it correlates well with other laziness metrics
gathered during benchmarking like the gathered during benchmarking like the
introduction of new comments that contain "...". introduction of new comments that contain "...".
The result is a pragmatic benchmark suite that provokes, detects and quantifies laziness. The result is a pragmatic
[benchmark suite that provokes, detects and quantifies GPT coding laziness](https://github.com/paul-gauthier/refactor-benchmark).
## Unified diff editing format ## Unified diff editing format
@ -126,12 +131,13 @@ GPT-4 code editing format:
- HIGH LEVEL - Encourage GPT to structure edits as new versions of substantive code blocks (functions, methods, etc), not as a series of surgical/minimal changes to individual lines of code. - HIGH LEVEL - Encourage GPT to structure edits as new versions of substantive code blocks (functions, methods, etc), not as a series of surgical/minimal changes to individual lines of code.
- FLEXIBLE - Strive to be maximally flexible when interpreting GPT's edit instructions. - FLEXIBLE - Strive to be maximally flexible when interpreting GPT's edit instructions.
A helpful shortcut here is to have empathy for GPT, and imagine you are on A helpful shortcut here is to have empathy for GPT, and imagine you
the other end of the conversation being tasked with specifying code edits. are the one being tasked with specifying code edits.
Would you want to hand type a properly escaped json data structure Would you want to hand type a properly escaped json data structure
to specify surgical insert, delete, replace operations on specific code line numbers? to specify surgical insert, delete, replace operations on specific code line numbers?
Would you want a typo, off-by-one line number or flubbed escape character to trigger an error Would you want
and force you to start over? to trigger an error and be forced to start over
after any typo, off-by-one line number or flubbed escape character?
GPT is quantitatively better at code editing when you reduce the GPT is quantitatively better at code editing when you reduce the
burden of formatting edits by using a familiar, simple, high level burden of formatting edits by using a familiar, simple, high level
@ -172,10 +178,11 @@ They need to *accurately* reflect the original and updated file contents,
otherwise the patch command will fail to apply the changes. otherwise the patch command will fail to apply the changes.
Having GPT specify changes in a well-known format that is usually consumed by a Having GPT specify changes in a well-known format that is usually consumed by a
fairly rigid program like patch fairly rigid program like patch
seems to discourage it from seems to encourage rigor.
leaving informal editing instructions in comments GPT is less likely to
and being lazy leave informal editing instructions in comments
about writing all the needed code. or be lazy about writing all the needed code.
With unified diffs, GPT acts more like it's writing textual data intended to be read by a program, With unified diffs, GPT acts more like it's writing textual data intended to be read by a program,
not talking to a person. not talking to a person.
@ -215,8 +222,8 @@ A unified diff looks pretty much like the code it is modifying.
The one complicated piece is the line numbers found at the start The one complicated piece is the line numbers found at the start
of each hunk that look something like this: `@@ -2,4 +3,5 @@`. of each hunk that look something like this: `@@ -2,4 +3,5 @@`.
This example is from a This example is from a
hunk that will change lines 2-4 in the original file hunk that would change lines 2-4 in the original file
into what will become lines 3-5 in the updated file. into what would become lines 3-5 in the updated file.
You've probably read a lot of unified diffs without ever You've probably read a lot of unified diffs without ever
caring about these line numbers, caring about these line numbers,
@ -276,14 +283,14 @@ def main(args):
Simple, right? Simple, right?
## Encourage high level edits ### Encourage high level edits
The example unified diffs we've seen so far have all been single line changes, The example unified diffs we've seen so far have all been single line changes,
which makes them pretty easy to read and understand. which makes them pretty easy to read and understand.
Consider this slightly more complex change, which renames the variable `n` to Consider this slightly more complex change, which renames the variable `n` to
`number`: `number`:
``` diff ```diff
@@ ... @@ @@ ... @@
-def factorial(n): -def factorial(n):
+def factorial(number): +def factorial(number):
@ -302,7 +309,7 @@ change is not as succinct as the minimal diff above,
but it is much easier to see two different coherent versions of the but it is much easier to see two different coherent versions of the
`factorial()` function. `factorial()` function.
``` diff ```diff
@@ ... @@ @@ ... @@
-def factorial(n): -def factorial(n):
- "compute factorial" - "compute factorial"
@ -350,8 +357,8 @@ applied as edits to the source files.
These imperfect diffs exhibit a variety of problems: These imperfect diffs exhibit a variety of problems:
- GPT forgets to include some semantically irrelevant lines or details. Often GPT forgets things like comments, docstrings, blank lines, etc. Or it skips over some code that it doesn't intend to change. - GPT forgets to include some semantically irrelevant lines or details. Often GPT forgets things like comments, docstrings, blank lines, etc. Or it skips over some code that it doesn't intend to change.
- GPT forgets the leading *plus* `+` character to mark novel lines that it wants to add to the file, and incorrectly includes them with a leading *space* ` `. - GPT forgets the leading *plus* `+` character to mark novel lines that it wants to add to the file. It incorrectly includes them with a leading *space* ` ` as if they were already in the file.
- GPT jumps ahead to a new part of the file without starting a new hunk with a `@@ ... @@` divider. - GPT jumps ahead to show edits to a different part of the file without starting a new hunk with a `@@ ... @@` divider.
As an example of the first issue, consider this source code: As an example of the first issue, consider this source code:
@ -425,13 +432,13 @@ Any naive attempt to use actual unified diffs
or any other strict diff format or any other strict diff format
is certainly doomed, is certainly doomed,
but the techniques described here and but the techniques described here and
now incorporated into aider provide incorporated into aider provide
a highly effective solution. a highly effective solution.
There could be significant benefits to There could be significant benefits to
fine tuning models on fine tuning models on
the simpler, high level style of diffs that are described here. the simpler, high level style of diffs that are described here.
Dropping the line numbers and focusing on diffs of Dropping line numbers from the hunk headers and focusing on diffs of
semantically coherent chunks of code semantically coherent chunks of code
seems to be an important part of successful GPT code editing. seems to be an important part of successful GPT code editing.
Most LLMs will have already seen plenty of unified diffs Most LLMs will have already seen plenty of unified diffs