From 837fd9e30bd89d883fd8fe524de45bc8e584f04c Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Tue, 19 Dec 2023 14:59:55 -0800 Subject: [PATCH] copy --- docs/unified-diffs.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/unified-diffs.md b/docs/unified-diffs.md index 70774a471..1e0a6646b 100644 --- a/docs/unified-diffs.md +++ b/docs/unified-diffs.md @@ -7,7 +7,9 @@ Aider now asks GPT-4 Turbo to use [unified diffs](#choose-a-familiar-editing-format) to edit your code. -This dramatically improves GPT-4 Turbo's performance on a complex benchmark +This dramatically improves GPT-4 Turbo's performance on a +challenging +new benchmark and significantly reduces its bad habit of "lazy" coding, where it writes code with comments @@ -17,15 +19,15 @@ Aider also has a new "laziness" benchmark suite designed to both provoke and quantify lazy coding. It consists of 89 python refactoring tasks -which tend to make GPT-4 Turbo very lazy. -On these tasks it often produces comments like +which tend to make GPT-4 Turbo lazy +and write comments like "...include the original method body...". This new laziness benchmark produced the following results with `gpt-4-1106-preview`: -- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It output "lazy comments" on 12 of the tasks. +- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It outputs "lazy comments" on 12 of the tasks. - **Aider's new unified diff edit format raised the score to 61%**. Using this format reduced laziness by 3X, with GPT-4 Turbo only using lazy comments on 4 of the tasks. -- **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did *slightly* reduce the amount of laziness, but at a large cost to successful benchmark outcomes. +- **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did slightly reduce the amount of laziness against baseline (to 8 lazy tasks). It increased the lazy tasks to 5 when added to the unified diff prompt. The older `gpt-4-0613` also did better on the laziness benchmark using unified diffs: