Lazy coding benchmark for gpt-4-0125-preview

OpenAI just released a new version of GPT-4 Turbo. This new model is intended to reduce the "laziness" that has been widely observed with the previous gpt-4-1106-preview model:

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

With that in mind, I've been benchmarking the new model using aider's existing lazy coding benchmark.

Benchmark results

These results are currently preliminary, and will be updated as additional benchmark runs complete.

Overall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview model:

It performs much worse when using the unified diffs code editing format.
Using aider's older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.

This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review the other reports for additional information:

GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
Code editing benchmarks for OpenAI's "1106" models.
Aider's lazy coding benchmark.

2 KiB Raw Blame History Unescape Escape

Lazy coding benchmark for gpt-4-0125-preview

Benchmark results

Related reports

2 KiB

Raw Blame History