copy

2025-06-01 02:05:00 +00:00 · 2023-12-17 16:45:11 -08:00 · 2023-12-17 16:45:11 -08:00 · 616aca8656
commit 616aca8656
parent e27d5d26b7
1 changed files with 20 additions and 19 deletions
--- a/docs/unified-diffs.md
+++ b/docs/unified-diffs.md
@ -12,22 +12,22 @@ This new support for unified diffs massively reduces GPT-4 Turbo's bad habit of
 There are abundant anecdotes
 about GPT-4 Turbo writing half completed code filled with comments that give
 homework assignments to the user
-like "...omitted for brevity..." or "...add logic here...".
+like
 "...add logic here..."
 or
 "...omitted for brevity...".
 Aider's new unified diff edit format significantly reduces this sort of lazy coding,
 as quantified by dramatically improved scores
-on a new "laziness benchmark".
+on a new "laziness" benchmark suite.
-Before trying to reduce laziness, I needed a way to quantify and measure
+Aider's new benchmarking suite is
-the problem.
+designed to both provoke and quantify lazy coding.
 I developed a new
 benchmarking suite designed to both provoke and quantify lazy coding.
 It consists of 39 python refactoring tasks,
 which ask GPT to remove a non-trivial method from a class and make it
 a stand alone function.
 GPT-4 Turbo is prone to being lazy on this sort of task, because it's mostly a
 "cut & paste" of code from one place in a file to another.
-GPT often creates the new function with a body that is empty except for
+Rather than writing out the code, GPT often just leaves
 a comment like
 "...include the body of the original method..."
@ -35,7 +35,7 @@ This new laziness benchmark produced the following results with `gpt-4-1106-prev
 - **GPT-4 Turbo only scored 15% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. This confirms the anecdotes that GPT-4 Turbo is quite lazy when coding, and serves as a baseline for comparison.
 - **Aider's new unified diff edit format raised the score to 65%**.
- **A system prompt based on widely circulated folklore only scored 15%, same as the baseline.** This experiment used the existing "SEARCH/REPLACE block" format with an additional prompt that claims the user is blind, has no hands, will tip $2000 and has suffered from "truncated code trauma".
+- **A system prompt based on widely circulated folklore performed same as the baseline.** This experiment used the existing "SEARCH/REPLACE block" format with an additional prompt that claims the user is blind, has no hands, will tip $2000 and has suffered from "truncated code trauma". This prompt scored only 15% on the refactor benchmark.
 The older `gpt-4-0613` also did better on the laziness benchmark by using unified diffs.
 The benchmark was designed to work with large source code files, and
@ -423,22 +423,23 @@ The result is a pragmatic
 ## Conclusions and future work
-Aider's new unified diff format seems very effective at stopping
+Based on the refactor benchmark results,
 aider's new unified diff format seems very effective at stopping
 GPT-4 Turbo from being a lazy coder.
-I suspect that anyone who has tried to have GPT edit code
+Unified diffs were one of the very first edit formats I tried
-started out asking for diffs of some kind.
+when first building aider.
-I know I did.
+I think a lot of other AI coding assistant projects have also
-Any naive attempt to use actual unified diffs
+tried going down this path.
-or any other strict diff format
+It seems that any naive or direct use of structure diff formats
-is certainly doomed,
+is pretty much doomed to failure.
-but the techniques described here and
+But the techniques described here and
 incorporated into aider provide
-a highly effective solution.
+a highly effective way to harness GPT's knowledge of unified diffs.
 There could be significant benefits to
 fine tuning models on
-the simpler, high level style of diffs that are described here.
+aider's simple, high level style of unified diffs.
 Dropping line numbers from the hunk headers and focusing on diffs of
 semantically coherent chunks of code
 seems to be an important part of successful GPT code editing.