mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-25 14:55:00 +00:00
copy
This commit is contained in:
parent
479f73871b
commit
3e5dba8d5c
2 changed files with 47 additions and 44 deletions
|
@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () {
|
|||
beginAtZero: true,
|
||||
title: {
|
||||
display: true,
|
||||
text: 'Total syntactic errors from 5 runs'
|
||||
text: 'Total syntax errors from 5 runs'
|
||||
},
|
||||
max: 35
|
||||
}
|
||||
|
@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () {
|
|||
plugins: {
|
||||
title: {
|
||||
display: true,
|
||||
text: 'Syntactic errors by model and code wrapping strategy',
|
||||
text: 'Syntax errors by model and code wrapping strategy',
|
||||
font: {
|
||||
size: 16
|
||||
}
|
||||
|
|
|
@ -12,8 +12,6 @@ nav_exclude: true
|
|||
# LLMs are bad at returning code in JSON
|
||||
|
||||
|
||||
## Abstract
|
||||
|
||||
Current LLMs have support for returning properly formatted JSON,
|
||||
making it easier for clients to reliably parse complex responses.
|
||||
It therefore seems attractive for
|
||||
|
@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability
|
|||
to correctly solve coding tasks.
|
||||
On a variant of the aider code editing benchmark,
|
||||
asking for JSON-wrapped code
|
||||
often significantly harms coding
|
||||
performance.
|
||||
often harms coding performance.
|
||||
This holds true across many top coding LLMs,
|
||||
including OpenAI's latest model gpt-4o-2024-08-06
|
||||
which has strong JSON support.
|
||||
|
@ -36,7 +33,7 @@ which has strong JSON support.
|
|||
> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call.
|
||||
|
||||
|
||||
## Introduction
|
||||
## Background
|
||||
|
||||
A lot of people wonder why aider doesn't use LLM tools for code editing.
|
||||
Instead, aider asks for code edits in plain text, like this:
|
||||
|
@ -66,14 +63,17 @@ which would return a structured JSON response:
|
|||
```
|
||||
|
||||
This has become even more tempting as LLM providers
|
||||
continue to improve their tooling for reliably generating
|
||||
valid JSON.
|
||||
For example, OpenAI recently announced the ability to
|
||||
[strictly enforce that JSON responses will be syntactically correct
|
||||
and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/).
|
||||
continue to improve their tooling for reliably generating JSON.
|
||||
For example,
|
||||
[OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/)
|
||||
the ability to
|
||||
strictly enforce that JSON responses will be syntactically correct
|
||||
and conform to a specified schema.
|
||||
|
||||
|
||||
But producing valid (schema compliant) JSON is not sufficient for working with AI generated code.
|
||||
The code inside the JSON has to be valid and high quality too.
|
||||
The code inside the JSON has to correctly solve the requested task
|
||||
and be free from syntax errors.
|
||||
Unfortunately,
|
||||
LLMs write worse code when they're asked to
|
||||
wrap it in JSON.
|
||||
|
@ -108,29 +108,23 @@ capable models.
|
|||
OpenAI's newly announced support for "strict" JSON seemed like a good reason to
|
||||
investigate whether the newest models are still handicapped by JSON-wrapping code.
|
||||
|
||||
Four of the strongest code editing models were benchmarked
|
||||
to assess the impact of JSON-wrapping code:
|
||||
The results presented here were based on
|
||||
the
|
||||
[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark)
|
||||
of 133 practice exercises from the Exercism python repository.
|
||||
Models were
|
||||
restricted to a single attempt,
|
||||
without a second try to fix errors as is normal in the aider benchmark.
|
||||
|
||||
- claude-3-5-sonnet-20240620
|
||||
- deepseek-coder (V2 0724)
|
||||
- gpt-4o-2024-05-13
|
||||
- gpt-4o-2024-08-06
|
||||
|
||||
Each model was given one try to solve
|
||||
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
|
||||
This is the standard aider "code editing" benchmark, but restricted to a single attempt
|
||||
without a second try to "fix" any errors.
|
||||
|
||||
The benchmark assessed the models coding ability
|
||||
using different strategies for returning code:
|
||||
The performance of each model was compared across different strategies for returning code:
|
||||
|
||||
- **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences.
|
||||
- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON.
|
||||
- **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON.
|
||||
- **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting.
|
||||
|
||||
The markdown strategy is the same as
|
||||
aider's "whole" edit format, where the
|
||||
LLM would return a source file like this:
|
||||
LLM returns a source file like this:
|
||||
|
||||
````
|
||||
Here is the program you asked for which prints "Hello":
|
||||
|
@ -163,13 +157,20 @@ than correctly formulating
|
|||
instructions to edit
|
||||
portions of a file.
|
||||
|
||||
This experimental setup is designed to highlight
|
||||
This experimental setup is designed to quantify
|
||||
the effects of JSON-wrapping on the LLMs ability to write code to solve a task.
|
||||
|
||||
## Results
|
||||
|
||||
Each of the 4 models was benchmarked 5 times using the different
|
||||
strategies for returning code.
|
||||
Four of the strongest code editing models were benchmarked
|
||||
to assess the impact of JSON-wrapping code:
|
||||
|
||||
- claude-3-5-sonnet-20240620
|
||||
- deepseek-coder (V2 0724)
|
||||
- gpt-4o-2024-05-13
|
||||
- gpt-4o-2024-08-06
|
||||
|
||||
Each combination of model and code wrapping strategy was benchmarked 5 times.
|
||||
|
||||
## Overall coding skill
|
||||
|
||||
|
@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result.
|
|||
|
||||
## Syntax errors
|
||||
|
||||
Figure 2 shows the number of syntactic errors found in the code produced by each
|
||||
model and code wrapping strategy.
|
||||
Models tend to make more syntactic errors when asked to wrap code in JSON.
|
||||
Models tend to make more syntax errors when asked to wrap code in JSON.
|
||||
Figure 2 shows the number of syntax errors found in the code produced by each
|
||||
model and code wrapping strategy,
|
||||
totaling up `SyntaxError` and `IndentationError` errors from all 5 runs.
|
||||
|
||||
Sonnet avoided syntactic errors regardless of the code wrapping strategy,
|
||||
but its benchmark scores in Figure 1 were lower with JSON.
|
||||
This seems to indicate that JSON-wrapping
|
||||
does more than simply raise the syntactic difficulty in coding.
|
||||
It may distract or challenge the model in a way that
|
||||
reduces its ability to reason about coding problems.
|
||||
|
||||
Sonnet's results seems to indicate that the negative effects of JSON-wrapping
|
||||
go beyond syntactic difficulties.
|
||||
Sonnet avoided syntax errors regardless of the code wrapping strategy,
|
||||
but its benchmark scores in Figure 1 were nonetheless lower with JSON.
|
||||
This implies that JSON-wrapping may distract or challenge models in a way that
|
||||
reduces their ability to reason about solving coding problems.
|
||||
|
||||
{% include code-in-json-syntax.js %}
|
||||
|
||||
> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code,
|
||||
> totaled from 5 runs.
|
||||
> Models tend to make more syntactic errors when asked to wrap code in JSON.
|
||||
> Models tend to make more syntax and formatting errors when asked to wrap code in JSON.
|
||||
|
||||
|
||||
## Conclusions
|
||||
|
@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON.
|
|||
|
||||
OpenAI appears to be making progress in allowing LLMs to return code in
|
||||
structured JSON responses without harming the code quality.
|
||||
But it seems premature to consider switching from plain text
|
||||
But it still seems premature to consider switching from plain text
|
||||
to JSON-wrapped code.
|
||||
|
||||
|
||||
|
@ -227,4 +230,4 @@ The results presented here are not directly comparable to results
|
|||
from the main
|
||||
[aider LLM leaderboard](https://aider.chat/docs/leaderboards/).
|
||||
A number of settings were changed to simplify the benchmark
|
||||
in order to focus on comparing plain text and JSON wrapped code.
|
||||
in order to focus on comparing plain text and JSON-wrapped code.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue