Code editing benchmarks for OpenAI's "1106" models

OpenAI just released a new version of GPT-4 Turbo. This new model is intended to reduce the "lazy coding" that has been widely observed with the previous gpt-1106-preview model:

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

With that in mind, I've been benchmarking the new model using aider's existing lazy coding benchmark.

Benchmark results

These results are currently preliminary, and will be updated as additional benchmark runs complete.

The new gpt-4-0125-preview model produces mixed results on the lazy coding benchmark as compared to the November gpt-4-1106-preview model:

It performs much worse when using the unified diffs code editign format.
Using aider's older SEARCH/REPLACE block editing format, the new January model outperfoms the older November model. But it still performs worse than both models using unified diffs.

This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review the other reports for additional information:

GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
Code editing benchmarks for OpenAI's "1106" models.
Aider's lazy coding benchmark.

Updates

Last updated 11/14/23. OpenAI has relaxed rate limits so these results are no longer considered preliminary.

2 KiB Raw Blame History Unescape Escape

Code editing benchmarks for OpenAI's "1106" models

Benchmark results

Related reports

Updates

2 KiB

Raw Blame History