mirror of https://github.com/Aider-AI/aider.git synced 2025-05-24 06:15:00 +00:00

Paul Gauthier c64f434446 added images

2024-02-19 17:46:30 -08:00

2.2 KiB

Raw Blame History

title	excerpt	highlight_image
The January GPT-4 Turbo is lazier than the last version	The new `gpt-4-0125-preview` model is quantiatively lazier at coding than previous GPT-4 versions, according to a new "laziness" benchmark.	/assets/benchmarks-0125.svg

The January GPT-4 Turbo is lazier than the last version

OpenAI just released a new version of GPT-4 Turbo. This new model is intended to reduce the "laziness" that has been widely observed with the previous gpt-4-1106-preview model:

Today, we are releasing an updated GPT-4 Turbo preview model, gpt-4-0125-preview. This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

With that in mind, I've been benchmarking the new model using aider's existing lazy coding benchmark.

Benchmark results

Overall, the new gpt-4-0125-preview model seems lazier than the November gpt-4-1106-preview model:

It gets worse benchmark scores when using the unified diffs code editing format.
Using aider's older SEARCH/REPLACE block editing format, the new January model outperforms the older November model. But it still performs worse than both models using unified diffs.

This is one in a series of reports that use the aider benchmarking suite to assess and compare the code editing capabilities of OpenAI's GPT models. You can review the other reports for additional information:

GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
Code editing benchmarks for OpenAI's "1106" models.
Aider's lazy coding benchmark.

2.2 KiB Raw Blame History Unescape Escape

The January GPT-4 Turbo is lazier than the last version

Benchmark results

Related reports

2.2 KiB

Raw Blame History