move into website/

2025-06-14 16:44:59 +00:00 · 2024-06-05 14:28:39 -07:00 · 2024-06-05 14:28:39 -07:00 · 56519361e2
commit 56519361e2
parent 5a4d38418d
103 changed files with 5 additions and 12 deletions
--- a/_posts/2024-04-09-gpt-4-turbo.md
+++ b/_posts/2024-04-09-gpt-4-turbo.md
@ -1,70 +0,0 @@
---
-title: GPT-4 Turbo with Vision is a step backwards for coding
-excerpt: OpenAI's GPT-4 Turbo with Vision model scores worse on aider's code editing benchmarks than all the previous GPT-4 models. In particular, it seems much more prone to "lazy coding" than the existing GPT-4 Turbo "preview" models.
-highlight_image: /assets/2024-04-09-gpt-4-turbo-laziness.jpg
-nav_exclude: true
---
-# GPT-4 Turbo with Vision is a step backwards for coding
-
-[OpenAI just released GPT-4 Turbo with Vision](https://twitter.com/OpenAIDevs/status/1777769463258988634)
-and it performs worse on aider's coding benchmark suites than all the previous GPT-4 models.
-In particular, it seems much more prone to "lazy coding" than the
-existing GPT-4 Turbo "preview" models.
-
-## Code editing skill
-
-[![benchmark results](/assets/2024-04-09-gpt-4-turbo.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo.svg)
-
-Aider relies on a
-[code editing benchmark](https://aider.chat/docs/benchmarks.html#the-benchmark)
-to quantitatively evaluate how well
-an LLM can make changes to existing code.
-The benchmark uses aider to try and complete
-[133 Exercism Python coding exercises](https://github.com/exercism/python).
-
-For each exercise, the LLM gets two tries to solve each problem:
-
-1. On the first try, it gets initial stub code and the English description of the coding task. If the tests all pass, we are done.
-2. If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.
-
-**GPT-4 Turbo with Vision
-scores only 62% on this benchmark,
-the lowest score of any of the existing GPT-4 models.**
-The other models scored 63-66%, so this represents only a small
-regression, and is likely statistically insignificant when compared
-against `gpt-4-0613`.
-
-## Lazy coding
-
-[![benchmark results](/assets/2024-04-09-gpt-4-turbo-laziness.svg)](https://aider.chat/assets/2024-04-09-gpt-4-turbo-laziness.svg)
-
-The GPT-4 Turbo "preview" models have been widely criticized for being "lazy"
-when coding.
-They often omit needed code
-and instead leave comments with homework assignments like "implement method here".
-
-```
-def some_complex_method(foo, bar):
-    # ... implement method here ...
-```
-
-Aider uses a ["laziness" benchmark suite](https://github.com/paul-gauthier/refactor-benchmark)
-which is designed to both provoke and quantify lazy coding.
-It consists of
-89 python refactoring tasks
-which tend to make GPT-4 Turbo code in that lazy manner.
-
-**The new GPT-4 Turbo with Vision model scores only 34% on aider's
-refactoring benchmark, making it the laziest coder of all the GPT-4 Turbo models
-by a significant margin.**
-
-# Conclusions
-
-Aider has full support for the new GPT-4 Turbo with Vision
-model, which you can access using the switch `--model gpt-4-turbo-2024-04-09`.
-But aider will continue to use `gpt-4-1106-preview` by default,
-as it is by far the strongest coder of the GPT-4 models.
-
-
-
-