aider/docs/benchmarks-0125.md
2024-02-19 17:46:30 -08:00

2.2 KiB
Raw Blame History

title excerpt highlight_image
The January GPT-4 Turbo is lazier than the last version The new `gpt-4-0125-preview` model is quantiatively lazier at coding than previous GPT-4 versions, according to a new "laziness" benchmark. /assets/benchmarks-0125.svg

The January GPT-4 Turbo is lazier than the last version

benchmark results

OpenAI just released a new version of GPT-4 Turbo. This new model is intended to reduce the "laziness" that has been widely observed with the previous gpt-4-1106-preview model:

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesnt complete a task.

With that in mind, I've been benchmarking the new model using aider's existing lazy coding benchmark.

Benchmark results

Overall, the new gpt-4-0125-preview model seems lazier than the November gpt-4-1106-preview model:

  • It gets worse benchmark scores when using the unified diffs code editing format.
  • Using aider's older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.

This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review the other reports for additional information: