This commit is contained in:
Paul Gauthier 2024-08-15 12:14:49 -07:00
parent 479f73871b
commit 3e5dba8d5c
2 changed files with 47 additions and 44 deletions

View file

@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () {
beginAtZero: true, beginAtZero: true,
title: { title: {
display: true, display: true,
text: 'Total syntactic errors from 5 runs' text: 'Total syntax errors from 5 runs'
}, },
max: 35 max: 35
} }
@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () {
plugins: { plugins: {
title: { title: {
display: true, display: true,
text: 'Syntactic errors by model and code wrapping strategy', text: 'Syntax errors by model and code wrapping strategy',
font: { font: {
size: 16 size: 16
} }

View file

@ -12,8 +12,6 @@ nav_exclude: true
# LLMs are bad at returning code in JSON # LLMs are bad at returning code in JSON
## Abstract
Current LLMs have support for returning properly formatted JSON, Current LLMs have support for returning properly formatted JSON,
making it easier for clients to reliably parse complex responses. making it easier for clients to reliably parse complex responses.
It therefore seems attractive for It therefore seems attractive for
@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability
to correctly solve coding tasks. to correctly solve coding tasks.
On a variant of the aider code editing benchmark, On a variant of the aider code editing benchmark,
asking for JSON-wrapped code asking for JSON-wrapped code
often significantly harms coding often harms coding performance.
performance.
This holds true across many top coding LLMs, This holds true across many top coding LLMs,
including OpenAI's latest model gpt-4o-2024-08-06 including OpenAI's latest model gpt-4o-2024-08-06
which has strong JSON support. which has strong JSON support.
@ -36,7 +33,7 @@ which has strong JSON support.
> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. > Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call.
## Introduction ## Background
A lot of people wonder why aider doesn't use LLM tools for code editing. A lot of people wonder why aider doesn't use LLM tools for code editing.
Instead, aider asks for code edits in plain text, like this: Instead, aider asks for code edits in plain text, like this:
@ -66,14 +63,17 @@ which would return a structured JSON response:
``` ```
This has become even more tempting as LLM providers This has become even more tempting as LLM providers
continue to improve their tooling for reliably generating continue to improve their tooling for reliably generating JSON.
valid JSON. For example,
For example, OpenAI recently announced the ability to [OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/)
[strictly enforce that JSON responses will be syntactically correct the ability to
and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/). strictly enforce that JSON responses will be syntactically correct
and conform to a specified schema.
But producing valid (schema compliant) JSON is not sufficient for working with AI generated code. But producing valid (schema compliant) JSON is not sufficient for working with AI generated code.
The code inside the JSON has to be valid and high quality too. The code inside the JSON has to correctly solve the requested task
and be free from syntax errors.
Unfortunately, Unfortunately,
LLMs write worse code when they're asked to LLMs write worse code when they're asked to
wrap it in JSON. wrap it in JSON.
@ -108,29 +108,23 @@ capable models.
OpenAI's newly announced support for "strict" JSON seemed like a good reason to OpenAI's newly announced support for "strict" JSON seemed like a good reason to
investigate whether the newest models are still handicapped by JSON-wrapping code. investigate whether the newest models are still handicapped by JSON-wrapping code.
Four of the strongest code editing models were benchmarked The results presented here were based on
to assess the impact of JSON-wrapping code: the
[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark)
of 133 practice exercises from the Exercism python repository.
Models were
restricted to a single attempt,
without a second try to fix errors as is normal in the aider benchmark.
- claude-3-5-sonnet-20240620 The performance of each model was compared across different strategies for returning code:
- deepseek-coder (V2 0724)
- gpt-4o-2024-05-13
- gpt-4o-2024-08-06
Each model was given one try to solve
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
This is the standard aider "code editing" benchmark, but restricted to a single attempt
without a second try to "fix" any errors.
The benchmark assessed the models coding ability
using different strategies for returning code:
- **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences. - **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences.
- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON. - **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON.
- **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting. - **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting.
The markdown strategy is the same as The markdown strategy is the same as
aider's "whole" edit format, where the aider's "whole" edit format, where the
LLM would return a source file like this: LLM returns a source file like this:
```` ````
Here is the program you asked for which prints "Hello": Here is the program you asked for which prints "Hello":
@ -163,13 +157,20 @@ than correctly formulating
instructions to edit instructions to edit
portions of a file. portions of a file.
This experimental setup is designed to highlight This experimental setup is designed to quantify
the effects of JSON-wrapping on the LLMs ability to write code to solve a task. the effects of JSON-wrapping on the LLMs ability to write code to solve a task.
## Results ## Results
Each of the 4 models was benchmarked 5 times using the different Four of the strongest code editing models were benchmarked
strategies for returning code. to assess the impact of JSON-wrapping code:
- claude-3-5-sonnet-20240620
- deepseek-coder (V2 0724)
- gpt-4o-2024-05-13
- gpt-4o-2024-08-06
Each combination of model and code wrapping strategy was benchmarked 5 times.
## Overall coding skill ## Overall coding skill
@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result.
## Syntax errors ## Syntax errors
Figure 2 shows the number of syntactic errors found in the code produced by each Models tend to make more syntax errors when asked to wrap code in JSON.
model and code wrapping strategy. Figure 2 shows the number of syntax errors found in the code produced by each
Models tend to make more syntactic errors when asked to wrap code in JSON. model and code wrapping strategy,
totaling up `SyntaxError` and `IndentationError` errors from all 5 runs.
Sonnet avoided syntactic errors regardless of the code wrapping strategy,
but its benchmark scores in Figure 1 were lower with JSON. Sonnet's results seems to indicate that the negative effects of JSON-wrapping
This seems to indicate that JSON-wrapping go beyond syntactic difficulties.
does more than simply raise the syntactic difficulty in coding. Sonnet avoided syntax errors regardless of the code wrapping strategy,
It may distract or challenge the model in a way that but its benchmark scores in Figure 1 were nonetheless lower with JSON.
reduces its ability to reason about coding problems. This implies that JSON-wrapping may distract or challenge models in a way that
reduces their ability to reason about solving coding problems.
{% include code-in-json-syntax.js %} {% include code-in-json-syntax.js %}
> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code, > Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code,
> totaled from 5 runs. > totaled from 5 runs.
> Models tend to make more syntactic errors when asked to wrap code in JSON. > Models tend to make more syntax and formatting errors when asked to wrap code in JSON.
## Conclusions ## Conclusions
@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON.
OpenAI appears to be making progress in allowing LLMs to return code in OpenAI appears to be making progress in allowing LLMs to return code in
structured JSON responses without harming the code quality. structured JSON responses without harming the code quality.
But it seems premature to consider switching from plain text But it still seems premature to consider switching from plain text
to JSON-wrapped code. to JSON-wrapped code.
@ -227,4 +230,4 @@ The results presented here are not directly comparable to results
from the main from the main
[aider LLM leaderboard](https://aider.chat/docs/leaderboards/). [aider LLM leaderboard](https://aider.chat/docs/leaderboards/).
A number of settings were changed to simplify the benchmark A number of settings were changed to simplify the benchmark
in order to focus on comparing plain text and JSON wrapped code. in order to focus on comparing plain text and JSON-wrapped code.