2 KiB
Lazy coding benchmark for gpt-4-0125-preview
OpenAI just released a new version of GPT-4 Turbo.
This new model is intended to reduce the "laziness" that has been widely observed with the previous gpt-4-1106-preview
model:
Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.
With that in mind, I've been benchmarking the new model using aider's existing lazy coding benchmark.
Benchmark results
These results are currently preliminary, and will be updated as additional benchmark runs complete.
Overall,
the new gpt-4-0125-preview
model does worse on the lazy coding benchmark
as compared to the November gpt-4-1106-preview
model:
- It performs much worse when using the unified diffs code editing format.
- Using aider's older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.
Related reports
This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review the other reports for additional information:
- GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
- Code editing benchmarks for OpenAI's "1106" models.
- Aider's lazy coding benchmark.