This commit is contained in:
Paul Gauthier 2023-12-19 14:59:55 -08:00
parent 3e639639d5
commit 837fd9e30b

View file

@ -7,7 +7,9 @@
Aider now asks GPT-4 Turbo to use Aider now asks GPT-4 Turbo to use
[unified diffs](#choose-a-familiar-editing-format) [unified diffs](#choose-a-familiar-editing-format)
to edit your code. to edit your code.
This dramatically improves GPT-4 Turbo's performance on a complex benchmark This dramatically improves GPT-4 Turbo's performance on a
challenging
new benchmark
and significantly reduces its bad habit of "lazy" coding, and significantly reduces its bad habit of "lazy" coding,
where it writes where it writes
code with comments code with comments
@ -17,15 +19,15 @@ Aider also has a new "laziness" benchmark suite
designed to both provoke and quantify lazy coding. designed to both provoke and quantify lazy coding.
It consists of It consists of
89 python refactoring tasks 89 python refactoring tasks
which tend to make GPT-4 Turbo very lazy. which tend to make GPT-4 Turbo lazy
On these tasks it often produces comments like and write comments like
"...include the original method body...". "...include the original method body...".
This new laziness benchmark produced the following results with `gpt-4-1106-preview`: This new laziness benchmark produced the following results with `gpt-4-1106-preview`:
- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It output "lazy comments" on 12 of the tasks. - **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It outputs "lazy comments" on 12 of the tasks.
- **Aider's new unified diff edit format raised the score to 61%**. Using this format reduced laziness by 3X, with GPT-4 Turbo only using lazy comments on 4 of the tasks. - **Aider's new unified diff edit format raised the score to 61%**. Using this format reduced laziness by 3X, with GPT-4 Turbo only using lazy comments on 4 of the tasks.
- **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did *slightly* reduce the amount of laziness, but at a large cost to successful benchmark outcomes. - **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did slightly reduce the amount of laziness against baseline (to 8 lazy tasks). It increased the lazy tasks to 5 when added to the unified diff prompt.
The older `gpt-4-0613` also did better on the laziness benchmark using unified diffs: The older `gpt-4-0613` also did better on the laziness benchmark using unified diffs: