copy

2025-05-25 14:55:00 +00:00 · 2024-08-15 12:14:49 -07:00 · 2024-08-15 12:14:49 -07:00 · 3e5dba8d5c
commit 3e5dba8d5c
parent 479f73871b
2 changed files with 47 additions and 44 deletions
--- a/aider/website/_includes/code-in-json-syntax.js
+++ b/aider/website/_includes/code-in-json-syntax.js
@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () {
                    beginAtZero: true,
                    title: {
                        display: true,
-                        text: 'Total syntactic errors from 5 runs'
+                        text: 'Total syntax errors from 5 runs'
                    },
                    max: 35
                }
@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () {
            plugins: {
                title: {
                    display: true,
-                    text: 'Syntactic errors by model and code wrapping strategy',
+                    text: 'Syntax errors by model and code wrapping strategy',
                    font: {
                        size: 16
                    }
--- a/aider/website/_posts/2024-08-14-code-in-json.md
+++ b/aider/website/_posts/2024-08-14-code-in-json.md
@ -12,8 +12,6 @@ nav_exclude: true
 # LLMs are bad at returning code in JSON


-## Abstract
-
 Current LLMs have support for returning properly formatted JSON,
 making it easier for clients to reliably parse complex responses.
 It therefore seems attractive for
@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability
 to correctly solve coding tasks.
 On a variant of the aider code editing benchmark, 
 asking for JSON-wrapped code
-often significantly harms coding
-performance.
+often harms coding performance.
 This holds true across many top coding LLMs, 
 including OpenAI's latest model gpt-4o-2024-08-06 
 which has strong JSON support.
@ -36,7 +33,7 @@ which has strong JSON support.
 > Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call.


-## Introduction
+## Background

 A lot of people wonder why aider doesn't use LLM tools for code editing.
 Instead, aider asks for code edits in plain text, like this:
@ -66,14 +63,17 @@ which would return a structured JSON response:
 ```

 This has become even more tempting as LLM providers
-continue to improve their tooling for reliably generating
-valid JSON.
-For example, OpenAI recently announced the ability to
-[strictly enforce that JSON responses will be syntactically correct 
-and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/).
+continue to improve their tooling for reliably generating JSON.
+For example, 
+[OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/)
+the ability to
+strictly enforce that JSON responses will be syntactically correct 
+and conform to a specified schema.
+

 But producing valid (schema compliant) JSON is not sufficient for working with AI generated code.
-The code inside the JSON has to be valid and high quality too.
+The code inside the JSON has to correctly solve the requested task
+and be free from syntax errors.
 Unfortunately, 
 LLMs write worse code when they're asked to 
 wrap it in JSON.
@ -108,29 +108,23 @@ capable models.
 OpenAI's newly announced support for "strict" JSON seemed like a good reason to
 investigate whether the newest models are still handicapped by JSON-wrapping code.

-Four of the strongest code editing models were benchmarked
-to assess the impact of JSON-wrapping code:
+The results presented here were based on
+the 
+[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark)
+of 133 practice exercises from the Exercism python repository.
+Models were 
+restricted to a single attempt,
+without a second try to fix errors as is normal in the aider benchmark.

- claude-3-5-sonnet-20240620
- deepseek-coder (V2 0724)
- gpt-4o-2024-05-13
- gpt-4o-2024-08-06
-
-Each model was given one try to solve 
-[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
-This is the standard aider "code editing" benchmark, but restricted to a single attempt
-without a second try to "fix" any errors.
-
-The benchmark assessed the models coding ability
-using different strategies for returning code:
+The performance of each model was compared across different strategies for returning code:

 - **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences.
- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON.
+- **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON.
 - **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting.

 The markdown strategy is the same as
 aider's "whole" edit format, where the
-LLM would return a source file like this:
+LLM returns a source file like this:

 ````
 Here is the program you asked for which prints "Hello":
@ -163,13 +157,20 @@ than correctly formulating
 instructions to edit
 portions of a file.

-This experimental setup is designed to highlight
+This experimental setup is designed to quantify
 the effects of JSON-wrapping on the LLMs ability to write code to solve a task.

 ## Results

-Each of the 4 models was benchmarked 5 times using the different
-strategies for returning code.
+Four of the strongest code editing models were benchmarked
+to assess the impact of JSON-wrapping code:
+
+- claude-3-5-sonnet-20240620
+- deepseek-coder (V2 0724)
+- gpt-4o-2024-05-13
+- gpt-4o-2024-08-06
+
+Each combination of model and code wrapping strategy was benchmarked 5 times.

 ## Overall coding skill

@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result.

 ## Syntax errors

-Figure 2 shows the number of syntactic errors found in the code produced by each
-model and code wrapping strategy.
-Models tend to make more syntactic errors when asked to wrap code in JSON.
+Models tend to make more syntax errors when asked to wrap code in JSON.
+Figure 2 shows the number of syntax errors found in the code produced by each
+model and code wrapping strategy,
+totaling up `SyntaxError` and `IndentationError` errors from all 5 runs.

-Sonnet avoided syntactic errors regardless of the code wrapping strategy,
-but its benchmark scores in Figure 1 were lower with JSON.
-This seems to indicate that JSON-wrapping 
-does more than simply raise the syntactic difficulty in coding.
-It may distract or challenge the model in a way that
-reduces its ability to reason about coding problems.
+
+Sonnet's results seems to indicate that the negative effects of JSON-wrapping 
+go beyond syntactic difficulties.
+Sonnet avoided syntax errors regardless of the code wrapping strategy,
+but its benchmark scores in Figure 1 were nonetheless lower with JSON.
+This implies that JSON-wrapping may distract or challenge models in a way that
+reduces their ability to reason about solving coding problems.

 {% include code-in-json-syntax.js %}

 > Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code,
 > totaled from 5 runs.
-> Models tend to make more syntactic errors when asked to wrap code in JSON.
+> Models tend to make more syntax and formatting errors when asked to wrap code in JSON.


 ## Conclusions
@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON.

 OpenAI appears to be making progress in allowing LLMs to return code in
 structured JSON responses without harming the code quality.
-But it seems premature to consider switching from plain text
+But it still seems premature to consider switching from plain text
 to JSON-wrapped code.


@ -227,4 +230,4 @@ The results presented here are not directly comparable to results
 from the main
 [aider LLM leaderboard](https://aider.chat/docs/leaderboards/).
 A number of settings were changed to simplify the benchmark
-in order to focus on comparing plain text and JSON wrapped code.
+in order to focus on comparing plain text and JSON-wrapped code.