Added gpt-4-turbo vision blog post

2025-06-25 05:55:01 +00:00 · 2024-04-09 16:55:35 -07:00 · 2024-04-09 16:55:35 -07:00 · 00f1cdb561
commit 00f1cdb561
parent b117c1580c
4 changed files with 3343 additions and 31 deletions
--- a/_posts/2024-04-09-gpt-4-turbo.md
+++ b/_posts/2024-04-09-gpt-4-turbo.md
@ -0,0 +1,69 @@
+---
+title: GPT-4 Turbo with Vision is a step backwards for coding
+excerpt: OpenAI's new `gpt-4-turbo-2024-04-09` model scores worse on aider's code editing benchmarks than all the previous GPT-4 models.
+highlight_image: /assets/2024-03-07-claude-3.svg
+---
+# GPT-4 Turbo with Vision is a step backwards for coding
+
+[OpenAI just released GPT-4 Turbo with Vision](https://twitter.com/OpenAIDevs/status/1777769463258988634)
+and it performs worse on aider's benchmark suites than all the previous GPT-4 models.
+In particular, it seems much more prone to "lazy coding" then the
+GPT-4 Turbo preview models.
+
+## Code editing skill
+
+[![benchmark results](/assets/2024-04-09-gpt-4-turbo.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo.svg)
+
+Aider relies on a
+[code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark)
+to quantitatively evaluate how well
+an LLM can make changes to existing code.
+The benchmark uses aider to try and complete
+[133 Exercism Python coding exercises](https://github.com/exercism/python).
+
+For each exercise, the LLM gets two tries to solve each problem:
+
+1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
+2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
+
+GPT-4 Turbo with Vision
+scores only 62% on this benchmark,
+the lowest score of any of the existing GPT-4 models.
+The other models scored 63-66%, so this represents only a small
+regression, and is likely statistically insignificant when compared
+against `gpt-4-0613`.
+
+## Lazy coding
+
+[![benchmark results](/assets/2024-04-09-gpt-4-turbo-laziness.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo-laziness.svg)
+
+The GPT-4 Turbo "preview" models have been widely criticized for being "lazy"
+when coding.
+They often omit needed code
+and instead leave comments with homework assignments like "implement method here".
+
+```
+def some_complex_method(foo, bar):
+    # ... implement method here ...
+```
+
+Aider uses a ["laziness" benchmark suite](https://github.com/paul-gauthier/refactor-benchmark)
+which is designed to both provoke and quantify lazy coding.
+It consists of
+89 python refactoring tasks
+which tend to make GPT-4 Turbo code in that lazy manner.
+
+The new GPT-4 Turbo with Vision model scores only 33% on aider's
+refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
+by a significant margin.
+
+# Conclusions
+
+Aider has full support for the new GPT-4 Turbo with Vision
+model, which you can access using the switch `--model gpt-4-turbo-2024-04-09`.
+But aider will continue to use `gpt-4-1106-preview` by default,
+as it is by far the strongest coder of the GPT-4 models.
+
+
+
+