diff --git a/aider/website/_includes/code-in-json-syntax.js b/aider/website/_includes/code-in-json-syntax.js index b315edea9..5c0e652b1 100644 --- a/aider/website/_includes/code-in-json-syntax.js +++ b/aider/website/_includes/code-in-json-syntax.js @@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () { beginAtZero: true, title: { display: true, - text: 'Total syntactic errors from 5 runs' + text: 'Total syntax errors from 5 runs' }, max: 35 } @@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () { plugins: { title: { display: true, - text: 'Syntactic errors by model and code wrapping strategy', + text: 'Syntax errors by model and code wrapping strategy', font: { size: 16 } diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 6546e1dfa..59cc444f4 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -12,8 +12,6 @@ nav_exclude: true # LLMs are bad at returning code in JSON -## Abstract - Current LLMs have support for returning properly formatted JSON, making it easier for clients to reliably parse complex responses. It therefore seems attractive for @@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability to correctly solve coding tasks. On a variant of the aider code editing benchmark, asking for JSON-wrapped code -often significantly harms coding -performance. +often harms coding performance. This holds true across many top coding LLMs, including OpenAI's latest model gpt-4o-2024-08-06 which has strong JSON support. @@ -36,7 +33,7 @@ which has strong JSON support. > Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. -## Introduction +## Background A lot of people wonder why aider doesn't use LLM tools for code editing. Instead, aider asks for code edits in plain text, like this: @@ -66,14 +63,17 @@ which would return a structured JSON response: ``` This has become even more tempting as LLM providers -continue to improve their tooling for reliably generating -valid JSON. -For example, OpenAI recently announced the ability to -[strictly enforce that JSON responses will be syntactically correct -and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/). +continue to improve their tooling for reliably generating JSON. +For example, +[OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/) +the ability to +strictly enforce that JSON responses will be syntactically correct +and conform to a specified schema. + But producing valid (schema compliant) JSON is not sufficient for working with AI generated code. -The code inside the JSON has to be valid and high quality too. +The code inside the JSON has to correctly solve the requested task +and be free from syntax errors. Unfortunately, LLMs write worse code when they're asked to wrap it in JSON. @@ -108,29 +108,23 @@ capable models. OpenAI's newly announced support for "strict" JSON seemed like a good reason to investigate whether the newest models are still handicapped by JSON-wrapping code. -Four of the strongest code editing models were benchmarked -to assess the impact of JSON-wrapping code: +The results presented here were based on +the +[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark) +of 133 practice exercises from the Exercism python repository. +Models were +restricted to a single attempt, +without a second try to fix errors as is normal in the aider benchmark. -- claude-3-5-sonnet-20240620 -- deepseek-coder (V2 0724) -- gpt-4o-2024-05-13 -- gpt-4o-2024-08-06 - -Each model was given one try to solve -[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). -This is the standard aider "code editing" benchmark, but restricted to a single attempt -without a second try to "fix" any errors. - -The benchmark assessed the models coding ability -using different strategies for returning code: +The performance of each model was compared across different strategies for returning code: - **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences. -- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON. +- **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON. - **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting. The markdown strategy is the same as aider's "whole" edit format, where the -LLM would return a source file like this: +LLM returns a source file like this: ```` Here is the program you asked for which prints "Hello": @@ -163,13 +157,20 @@ than correctly formulating instructions to edit portions of a file. -This experimental setup is designed to highlight +This experimental setup is designed to quantify the effects of JSON-wrapping on the LLMs ability to write code to solve a task. ## Results -Each of the 4 models was benchmarked 5 times using the different -strategies for returning code. +Four of the strongest code editing models were benchmarked +to assess the impact of JSON-wrapping code: + +- claude-3-5-sonnet-20240620 +- deepseek-coder (V2 0724) +- gpt-4o-2024-05-13 +- gpt-4o-2024-08-06 + +Each combination of model and code wrapping strategy was benchmarked 5 times. ## Overall coding skill @@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result. ## Syntax errors -Figure 2 shows the number of syntactic errors found in the code produced by each -model and code wrapping strategy. -Models tend to make more syntactic errors when asked to wrap code in JSON. +Models tend to make more syntax errors when asked to wrap code in JSON. +Figure 2 shows the number of syntax errors found in the code produced by each +model and code wrapping strategy, +totaling up `SyntaxError` and `IndentationError` errors from all 5 runs. -Sonnet avoided syntactic errors regardless of the code wrapping strategy, -but its benchmark scores in Figure 1 were lower with JSON. -This seems to indicate that JSON-wrapping -does more than simply raise the syntactic difficulty in coding. -It may distract or challenge the model in a way that -reduces its ability to reason about coding problems. + +Sonnet's results seems to indicate that the negative effects of JSON-wrapping +go beyond syntactic difficulties. +Sonnet avoided syntax errors regardless of the code wrapping strategy, +but its benchmark scores in Figure 1 were nonetheless lower with JSON. +This implies that JSON-wrapping may distract or challenge models in a way that +reduces their ability to reason about solving coding problems. {% include code-in-json-syntax.js %} > Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code, > totaled from 5 runs. -> Models tend to make more syntactic errors when asked to wrap code in JSON. +> Models tend to make more syntax and formatting errors when asked to wrap code in JSON. ## Conclusions @@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON. OpenAI appears to be making progress in allowing LLMs to return code in structured JSON responses without harming the code quality. -But it seems premature to consider switching from plain text +But it still seems premature to consider switching from plain text to JSON-wrapped code. @@ -227,4 +230,4 @@ The results presented here are not directly comparable to results from the main [aider LLM leaderboard](https://aider.chat/docs/leaderboards/). A number of settings were changed to simplify the benchmark -in order to focus on comparing plain text and JSON wrapped code. +in order to focus on comparing plain text and JSON-wrapped code.