This commit is contained in:
Paul Gauthier 2024-04-09 16:57:58 -07:00
parent 00f1cdb561
commit eea35a3414

View file

@ -1,7 +1,7 @@
--- ---
title: GPT-4 Turbo with Vision is a step backwards for coding title: GPT-4 Turbo with Vision is a step backwards for coding
excerpt: OpenAI's new `gpt-4-turbo-2024-04-09` model scores worse on aider's code editing benchmarks than all the previous GPT-4 models. excerpt: OpenAI's new gpt-4-turbo-2024-04-09 model scores worse on aider's code editing benchmarks than all the previous GPT-4 models.
highlight_image: /assets/2024-03-07-claude-3.svg highlight_image: /assets/2024-04-09-gpt-4-turbo-laziness.svg
--- ---
# GPT-4 Turbo with Vision is a step backwards for coding # GPT-4 Turbo with Vision is a step backwards for coding
@ -26,9 +26,9 @@ For each exercise, the LLM gets two tries to solve each problem:
1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done. 1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task. 2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
GPT-4 Turbo with Vision **GPT-4 Turbo with Vision
scores only 62% on this benchmark, scores only 62% on this benchmark,
the lowest score of any of the existing GPT-4 models. the lowest score of any of the existing GPT-4 models.**
The other models scored 63-66%, so this represents only a small The other models scored 63-66%, so this represents only a small
regression, and is likely statistically insignificant when compared regression, and is likely statistically insignificant when compared
against `gpt-4-0613`. against `gpt-4-0613`.
@ -53,9 +53,9 @@ It consists of
89 python refactoring tasks 89 python refactoring tasks
which tend to make GPT-4 Turbo code in that lazy manner. which tend to make GPT-4 Turbo code in that lazy manner.
The new GPT-4 Turbo with Vision model scores only 33% on aider's **The new GPT-4 Turbo with Vision model scores only 33% on aider's
refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
by a significant margin. by a significant margin.**
# Conclusions # Conclusions