diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 747eaa0cd..119059015 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -116,14 +116,13 @@ valid json, or even enforcing that it meets a specific schema. For example, OpenAI recently announced [strict enforcement of json responses](). -The problem is that LLMs are bad a writing code when you ask them to wrap it -into a json container. -The json tooling around the LLM helps make sure it's valid json, -which does solve an important problem. -LLMs used to frequently produce invalid json, so that's a big step forward. - -The problem remains, LLMs write worse code when they're asked to +But it's not sufficient to just produce +valid json, it also +has to contain quality code. +Unfortunately, +LLMs write worse code when they're asked to emit it wrapped in json. + In some sense this shouldn't be surprising. Just look at the very simple json example above, with the escaped @@ -140,12 +139,17 @@ typing it into a text file or hand typing it as a properly escaped json string? Previous [benchmark results](/2023/07/02/benchmarks.html) showed -the superiority of plain text coding compared to json-wrapped function calls, -but they were done over a year ago. +the superiority of returning code +as plain text coding compared to json-wrapped function calls. +But those results were obtained +over a year ago, against far less +capable models. OpenAI's newly announced support for "strict" json seemed like a good reason to investigate whether the newest models are still handicapped by json-wrapping code. -To find out, I benchmarked 3 of the strongest code editing models: +The graph above shows benchmark +results from +3 of the strongest code editing models: - gpt-4o-2024-08-06 - claude-3-5-sonnet-20240620 @@ -155,15 +159,18 @@ Each model was given one try to solve [133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). This is the standard aider "code editing" benchmark, except restricted to a single attempt. -Each model ran through the benchmark with two strategies for returning code: +Each model was assessed by the benchmark with two +different strategies for returning code: -- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences. +- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences. - **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json. -The markdown strategy would return a program like this: +The markdown strategy is the same as +aider's "whole" edit format. +It asks the LLM to return a program like this: ```` -Here is the program you asked for which prints "Hello, world!": +Here is the program you asked for which prints "Hello": greeting.py ``` @@ -177,18 +184,21 @@ two parameters, like this: ``` { - "explanation": "Here is the program you asked for which prints \"Hello, world!\"", + "explanation": "Here is the program you asked for which prints \"Hello\"", "content": "def greeting():\n print(\"Hello\")\n" } ``` -Both of these formats avoid actually *editing* source files, to keep things as +Both of these formats avoid actually *editing* source files, to keep +the task as simple as possible. -This makes the task much easier, since the LLM can emit the whole source file intact. -LLMs find it much more challenging to correctly formulate instructions to edit +The LLM can emit the whole source file intact, +which is much easier +than correctly formulating +instructions to edit portions of a file. -We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks. +We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task. ## Results