diff --git a/README.md b/README.md index 1f0b7ea63..b147b2f54 100644 --- a/README.md +++ b/README.md @@ -31,20 +31,19 @@ Aider is unique in that it lets you ask for changes to [pre-existing, larger cod - [FAQ](https://aider.chat/docs/faq.html) - [Discord](https://discord.gg/Tv2uQnR88V) -## New GPT-4 Turbo with 128k context window +## GPT-4 Turbo with 128k context and unified diffs Aider supports OpenAI's new GPT-4 model that has the massive 128k context window. -Early benchmark results -indicate that it is -[very fast](https://aider.chat/docs/benchmarks-speed-1106.html) -and a bit -[better at coding](https://aider.chat/docs/benchmarks-1106.html) -than previous GPT-4 models. +Benchmark results indicate that it is +[very fast](https://aider.chat/docs/benchmarks-speed-1106.html), +and a bit [better at coding](https://aider.chat/docs/benchmarks-1106.html) than previous GPT-4 models. + +Aider now supports a [unified diff editing format, which reduces GPT-4 Turbo's "lazy" coding](https://aider.chat/docs/unified-diffs.html). To use it, run aider like this: ``` -aider --model gpt-4-1106-preview +aider --4-turbo ``` ## Getting started diff --git a/aider/__init__.py b/aider/__init__.py index eda726119..37123159a 100644 --- a/aider/__init__.py +++ b/aider/__init__.py @@ -1 +1 @@ -__version__ = "0.18.2-dev" +__version__ = "0.19.1-dev" diff --git a/aider/coders/editblock_prompts.py b/aider/coders/editblock_prompts.py index da27cde6d..896670d1f 100644 --- a/aider/coders/editblock_prompts.py +++ b/aider/coders/editblock_prompts.py @@ -5,9 +5,11 @@ from .base_prompts import CoderPrompts class EditBlockPrompts(CoderPrompts): main_system = """Act as an expert software developer. +You are diligent and tireless! +You NEVER leave comments describing code without implementing it! +You always COMPLETELY IMPLEMENT the needed code! Always use best practices when coding. -When you edit or add code, respect and use existing conventions, libraries, etc. -Always COMPLETELY IMPLEMENT the needed code. +Respect and use existing conventions, libraries, etc that are already present in the code base. Take requests for changes to the supplied code. If the request is ambiguous, ask questions. @@ -172,10 +174,16 @@ Include *ALL* the code being searched and replaced! Only *SEARCH/REPLACE* files that are *read-write*. +To move code within a file, use 2 *SEARCH/REPLACE* blocks: 1 to delete it from its current location, 1 to insert it in the new location. + If you want to put code in a new file, use a *SEARCH/REPLACE block* with: - A new file path, including dir name if needed - An empty `SEARCH` section - The new file's contents in the `REPLACE` section + +You are diligent and tireless! +You NEVER leave comments describing code without implementing it! +You always COMPLETELY IMPLEMENT the needed code! """ files_content_prefix = "These are the *read-write* files:\n" diff --git a/aider/coders/udiff_prompts.py b/aider/coders/udiff_prompts.py index 4ab30bfc4..14d1a73ac 100644 --- a/aider/coders/udiff_prompts.py +++ b/aider/coders/udiff_prompts.py @@ -5,7 +5,9 @@ from .base_prompts import CoderPrompts class UnifiedDiffPrompts(CoderPrompts): main_system = """Act as an expert software developer. -You are diligent and tireless, and you always COMPLETELY IMPLEMENT the needed code. +You are diligent and tireless! +You NEVER leave comments describing code without implementing it! +You always COMPLETELY IMPLEMENT the needed code! Always use best practices when coding. Respect and use existing conventions, libraries, etc that are already present in the code base. @@ -94,7 +96,13 @@ When editing a function, method, loop, etc use a hunk to replace the *entire* co Delete the entire existing version with `-` lines and then add a new, updated version with `+` lines. This will help you generate correct code and correct diffs. +To move code within a file, use 2 hunks: 1 to delete it from its current location, 1 to insert it in the new location. + To make a new file, show a diff from `--- /dev/null` to `+++ path/to/new/file.ext`. + +You are diligent and tireless! +You NEVER leave comments describing code without implementing it! +You always COMPLETELY IMPLEMENT the needed code! """ files_content_prefix = "These are the *read-write* files:\n" diff --git a/assets/benchmarks-udiff.svg b/assets/benchmarks-udiff.svg index c2b3dda8a..f210e1767 100644 --- a/assets/benchmarks-udiff.svg +++ b/assets/benchmarks-udiff.svg @@ -6,7 +6,7 @@ - 2023-12-18T10:29:22.506290 + 2023-12-19T10:53:27.651517 image/svg+xml @@ -41,17 +41,17 @@ z - - + - + +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - - + @@ -364,11 +364,11 @@ L -3.5 0 +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -410,11 +410,11 @@ z +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -429,11 +429,11 @@ L 421.2 171.8352 +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -448,11 +448,11 @@ L 421.2 127.3328 +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -508,11 +508,11 @@ z +" clip-path="url(#p479ce647ef)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -851,78 +851,60 @@ z +" clip-path="url(#p479ce647ef)" style="fill: #b3e6a8; stroke: #ffffff; stroke-width: 1.5; stroke-linejoin: miter"/> - +" clip-path="url(#p479ce647ef)" style="fill: url(#h762c7e11f2); stroke: #ffffff; stroke-width: 1.5; stroke-linejoin: miter"/> - +" clip-path="url(#p479ce647ef)" style="fill: #b3d1e6; stroke: #ffffff; stroke-width: 1.5; stroke-linejoin: miter"/> + + + - + - + - + - - + + - - - + + - - + + - + - - + + - + + + + + + + + + + + + @@ -1280,20 +1302,20 @@ z - - + - + - + @@ -1373,18 +1395,33 @@ z - + +" style="fill: url(#h762c7e11f2); stroke: #ffffff; stroke-width: 1.5; stroke-linejoin: miter"/> - - + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - + @@ -1536,16 +1570,74 @@ z + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - + + + + + diff --git a/benchmark/benchmark.py b/benchmark/benchmark.py index e61cf038a..12979cda8 100755 --- a/benchmark/benchmark.py +++ b/benchmark/benchmark.py @@ -77,12 +77,12 @@ def show_stats(dirnames, graphs): # row.model = gpt4 + "\n" + row.model[len(gpt4) :] if "folk" in row.dir_name: - row.edit_format = "folk" + row.edit_format += "folk" if row.model == "gpt-4-0613": row.model += "\n(8k context window is\ntoo small for benchmark)" - if row.completed_tests < 133: + if row.completed_tests < 89: print(f"Warning: {row.dir_name} is incomplete: {row.completed_tests}") # if "repeat" in row.dir_name: @@ -311,6 +311,7 @@ def plot_refactoring(df): formats = df.columns models = df.index + dump(formats) for i, fmt in enumerate(formats): hatch = "" @@ -320,10 +321,14 @@ def plot_refactoring(df): elif fmt == "udiff": color = "#b3d1e6" label = "Unified diffs" - elif fmt == "folk": - label = "Prompt with blind, no hands, tip $2000, etc" + elif fmt == "difffolk": + label = "Baseline + blind, no hands, $2k tip, etc" color = "#b3e6a8" hatch = "////" + elif fmt == "udifffolk": + label = "Unified diffs + blind, no hands, $2k tip, etc" + color = "#b3d1e6" + hatch = "////" if zorder > 1: edge = dict( diff --git a/benchmark/refactor_tools.py b/benchmark/refactor_tools.py index a54663377..117770a67 100755 --- a/benchmark/refactor_tools.py +++ b/benchmark/refactor_tools.py @@ -21,25 +21,23 @@ class ParentNodeTransformer(ast.NodeTransformer): def verify_full_func_at_top_level(tree, func, func_children): - func_node = next( - ( - item - for item in ast.walk(tree) - if isinstance(item, ast.FunctionDef) and item.name == func - ), - None, - ) - assert func_node is not None, f"Function {func} not found" + func_nodes = [ + item for item in ast.walk(tree) if isinstance(item, ast.FunctionDef) and item.name == func + ] + assert func_nodes, f"Function {func} not found" - assert isinstance( - func_node.parent, ast.Module - ), f"{func} is not a top level function, it has parent {func_node.parent}" + for func_node in func_nodes: + if not isinstance(func_node.parent, ast.Module): + continue - num_children = sum(1 for _ in ast.walk(func_node)) - pct_diff_children = abs(num_children - func_children) * 100 / func_children - assert ( - pct_diff_children < 10 - ), f"Old method had {func_children} children, new method has {num_children}" + num_children = sum(1 for _ in ast.walk(func_node)) + pct_diff_children = abs(num_children - func_children) * 100 / func_children + assert ( + pct_diff_children < 10 + ), f"Old method had {func_children} children, new method has {num_children}" + return + + assert False, f"{func} is not a top level function" def verify_old_class_children(tree, old_class, old_class_children): @@ -132,7 +130,10 @@ def find_non_self_methods(path): non_self_methods = [] for filename in python_files: with open(filename, "r") as file: - node = ast.parse(file.read(), filename=filename) + try: + node = ast.parse(file.read(), filename=filename) + except: + pass checker = SelfUsageChecker() checker.visit(node) for method in checker.non_self_methods: @@ -145,7 +146,7 @@ def process(entry): fname, class_name, method_name, class_children, method_children = entry if method_children > class_children / 2: return - if method_children < 100: + if method_children < 250: return fname = Path(fname) @@ -154,7 +155,7 @@ def process(entry): print(f"{fname} {class_name} {method_name} {class_children} {method_children}") - dname = Path("tmp.benchmarks/refactor-benchmark") + dname = Path("tmp.benchmarks/refactor-benchmark-spyder") dname.mkdir(exist_ok=True) dname = dname / f"{fname.stem}_{class_name}_{method_name}" diff --git a/docs/unified-diffs.md b/docs/unified-diffs.md index 015d87e4e..be5b5b42e 100644 --- a/docs/unified-diffs.md +++ b/docs/unified-diffs.md @@ -1,29 +1,34 @@ -# Fixing GPT-4 Turbo laziness with unified diffs +# Reducing GPT-4 Turbo laziness with unified diffs ![robot flowchart](../assets/benchmarks-udiff.svg) - Aider now asks GPT-4 Turbo to use -[unified diffs](https://www.gnu.org/software/diffutils/manual/html_node/Example-Unified.html) +[unified diffs](#choose-a-familiar-editing-format) to edit your code. -This massively reduces GPT-4 Turbo's bad habit of "lazy" coding, -where it writes half completed code filled with comments +This dramatically improves GPT-4 Turbo's performance on a +challenging +new benchmark +and significantly reduces its bad habit of "lazy" coding, +where it writes +code with comments like "...add logic here...". -Aider also has a new benchmarking suite -designed to both provoke and quantify lazy coding. +Aider's new "laziness" benchmark suite +is designed to both provoke and quantify lazy coding. It consists of -39 python refactoring tasks, -which tend to make GPT-4 Turbo very lazy, -often resulting in comments like -"...include the original method body...". +89 python refactoring tasks +which tend to make GPT-4 Turbo lazy +and write comments like +"...include original method body...". This new laziness benchmark produced the following results with `gpt-4-1106-preview`: -- **GPT-4 Turbo only scored 15% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. -- **Aider's new unified diff edit format raised the score to 62%**. -- **No benefit from the user being blind, without hands, tipping $2000 or fearing truncated code trauma.** These widely circulated folk remedies performed no better than baseline when added to the system prompt with aider's SEARCH/REPLACE edit format. Including *all* of them still only scored at 15% +- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It outputs "lazy comments" on 12 of the tasks. +- **Aider's new unified diff edit format raised the score to 61%**. Using this format reduced laziness by 3X, with GPT-4 Turbo only using lazy comments on 4 of the tasks. +- **It's worse to add a prompt that says the user is blind, has no hands, will tip $2000 and fears truncated code trauma.** Widely circulated "emotional appeal" folk remedies +produced worse benchmark scores +for both the baseline SEARCH/REPLACE and new unified diff editing formats. The older `gpt-4-0613` also did better on the laziness benchmark using unified diffs: @@ -31,9 +36,22 @@ The older `gpt-4-0613` also did better on the laziness benchmark using unified d - **Aider's new unified diff edit format raised June GPT-4's score to 59%**. - The benchmark was designed to use large files, and 28% of them are too large to fit in June GPT-4's 8k context window. -This significantly harmed the benchmark results. +This puts a hard ceiling of 72% on how well the June model could possibly score. -Before settling on unified diffs, +With unified diffs, GPT acts more like it's writing textual data intended to be read by a program, +not talking to a person. +They are +usually +consumed by the +[patch](https://www.gnu.org/software/diffutils/manual/html_node/Merging-with-patch.html) +program, which is fairly rigid. +This seems to encourage rigor, making +GPT less likely to +leave informal editing instructions in comments +or be lazy about writing all the needed code. + +Aider's new unified diff editing format +outperforms other solutions I evaluated by a wide margin. I explored many other approaches including: prompts about being tireless and diligent, OpenAI's function/tool calling capabilities, @@ -43,8 +61,6 @@ and other diff-like formats. The results shared here reflect an extensive investigation and benchmark evaluations of many approaches. -Aider's new unified diff editing format -outperforms other solutions by a wide margin. The rest of this article will describe aider's new editing format and refactoring benchmark. It will highlight some key design decisions, @@ -66,7 +82,8 @@ A helpful shortcut here is to have empathy for GPT, and imagine you are the one being asked to specify code edits. Would you want to hand type a properly escaped json data structure to invoke surgical insert, delete, replace operations on specific code line numbers? -How would you feel about any mistake causing all your work to be discarded? +Do you want to use a brittle format, where any mistake +causes and error and all your work to be discarded? GPT is quantitatively better at code editing when you reduce the burden of formatting edits by using a familiar, simple, high level @@ -79,8 +96,8 @@ code edits, because it's the default output format of `git diff`: ```diff ---- a/hello.py -+++ b/hello.py +--- a/greeting.py ++++ b/greeting.py @@ -1,5 +1,5 @@ def main(args): # show a greeting @@ -94,23 +111,6 @@ seen *many* examples in its training data. It's been trained to generate text that conforms to the unified diff syntax. -Unified diffs are -usually intended to be consumed by the -[patch](https://www.gnu.org/software/diffutils/manual/html_node/Merging-with-patch.html) -program. -They need to *accurately* reflect the original and updated file contents, -otherwise the patch command will fail. -Having GPT specify changes in a format that is usually consumed by a -rigid program like patch -seems to encourage rigor. -GPT is less likely to -leave informal editing instructions in comments -or be lazy about writing all the needed code. - -With unified diffs, GPT acts more like it's writing textual data intended to be read by a program, -not talking to a person. - - ### Use a simple editing format Aider's [previous benchmark results](https://aider.chat/docs/benchmarks.html) made @@ -246,6 +246,7 @@ They exhibit a variety of problems: - GPT forgets things like comments, docstrings, blank lines, etc. Or it skips over some code that it doesn't intend to change. - GPT forgets the leading *plus* `+` character to mark novel lines that it wants to add to the file. It incorrectly includes them with a leading *space* as if they were already there. +- GPT outdents all of the code, removing all the leading white space which is shared across the lines. So a chunk of deeply indented code is shown in a diff with only the leading white space that changes between the lines in the chunk. - GPT jumps ahead to show edits to a different part of the file without starting a new hunk with a `@@ ... @@` divider. As an example of the first issue, consider this source code: @@ -285,6 +286,7 @@ If a hunk doesn't apply cleanly, aider uses a number of strategies: - Normalize the hunk, by taking the *minus* `-` and *space* lines as one version of the hunk and the *space* and *plus* `+` lines as a second version and doing an actual unified diff on them. - Try and discover new lines that GPT is trying to add but which it forgot to mark with *plus* `+` markers. This is done by diffing the *minus* `-` and *space* lines back against the original file. +- Try and apply the hunk using "relative leading white space", so we can match and patch correctly even if the hunk has been uniformly indented or outdented. - Break a large hunk apart into an overlapping sequence of smaller hunks, which each contain only one contiguous run of *plus* `+` and *minus* `-` lines. Try and apply each of these sub-hunks independently. - Vary the size and offset of the "context window" of *space* lines from the hunk that are used to localize the edit to a specific part of the file. - Combine the above mechanisms to progressively become more permissive about how to apply the hunk. @@ -292,11 +294,7 @@ If a hunk doesn't apply cleanly, aider uses a number of strategies: These flexible patching strategies are critical, and removing them radically increases the number of hunks which fail to apply. - -**Experiments where flexible patching is disabled show**: - -- **GPT-4 Turbo's performance drops from 65% down to 56%** on the refactoring benchmark. -- **A 9X increase in editing errors** on aider's original Exercism benchmark. +**Experiments where flexible patching is disabled show a 9X increase in editing errors** on aider's original Exercism benchmark. ## Refactoring benchmark @@ -309,12 +307,14 @@ the ones with the most code and which involve refactoring. Based on this observation, I set out to build a benchmark based on refactoring a non-trivial amount of code found in fairly large files. -To do this, I used python's `ast` module to analyze the -[Django repository](https://github.com/django/django) to: +To do this, I used python's `ast` module to analyze +[9 popular open source python repositories](https://github.com/paul-gauthier/refactor-benchmark) +to identify challenging refactoring tasks. +The goal was to find: -- Find source files that contain class methods which are non-trivial, having more than 100 AST nodes in their implementation. +- Source files that contain classes with non-trivial methods, having 100-250+ AST nodes in their implementation. - Focus on methods that are part of a larger class, which has at least twice as much code as the method itself. -- Find methods that don't use their `self` parameter, so they can be trivially refactored out of the class. +- Select methods that don't use their `self` parameter, so they can be trivially refactored out of the class. We can then turn each of these source files into a task for the benchmark, where we ask GPT to do something like: @@ -324,13 +324,13 @@ where we ask GPT to do something like: > Update any existing `self._set_csrf_cookie` calls to work with the new `_set_csrf_cookie` function. A [simple python AST scanning script](https://github.com/paul-gauthier/aider/blob/main/benchmark/refactor_tools.py) -found 39 suitable files +found 89 suitable files and packaged them up as benchmark tasks. Each task has a test -that checks if refactor +that checks if the refactor was performed roughly correctly: -- The updated source file must parse as valid python, to surface misapplied edits which corrupt the file. +- The updated source file must parse as valid python, to detect misapplied edits which produce invalid code. - The target method must now exist as a top-level function in the file. - This new top-level function must contain approximately the same number of AST nodes as the original class method. This ensures that GPT didn't elide code and replace it with comments. - The original class must still be present in the file, and it must be smaller by about the number of AST nodes in the method which was removed. This helps confirm that the method was removed from the class, without other significant modifications. @@ -349,8 +349,10 @@ The result is a pragmatic ## Conclusions and future work Based on the refactor benchmark results, -aider's new unified diff format seems very effective at stopping -GPT-4 Turbo from being a lazy coder. +aider's new unified diff format seems +to dramatically increase GPT-4 Turbo's skill at more complex coding tasks. +It also seems very effective at reducing the lazy coding +which has been widely noted as a problem with GPT-4 Turbo. Unified diffs was one of the very first edit formats I tried when originally building aider. @@ -367,8 +369,9 @@ fine tuning models on aider's simple, high level style of unified diffs. Dropping line numbers from the hunk headers and focusing on diffs of semantically coherent chunks of code -seems to be an important part of successful GPT code editing. +seems to be an important part of successful GPT code editing +(besides the relentless focus on flexibly applying edits). Most LLMs will have already seen plenty of unified diffs in their normal training data, and so should be -very amenable to fining tuning towards this +amenable to fining tuning towards this particular diff style.