mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-25 23:05:00 +00:00
copy
This commit is contained in:
parent
479f73871b
commit
3e5dba8d5c
2 changed files with 47 additions and 44 deletions
|
@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () {
|
||||||
beginAtZero: true,
|
beginAtZero: true,
|
||||||
title: {
|
title: {
|
||||||
display: true,
|
display: true,
|
||||||
text: 'Total syntactic errors from 5 runs'
|
text: 'Total syntax errors from 5 runs'
|
||||||
},
|
},
|
||||||
max: 35
|
max: 35
|
||||||
}
|
}
|
||||||
|
@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () {
|
||||||
plugins: {
|
plugins: {
|
||||||
title: {
|
title: {
|
||||||
display: true,
|
display: true,
|
||||||
text: 'Syntactic errors by model and code wrapping strategy',
|
text: 'Syntax errors by model and code wrapping strategy',
|
||||||
font: {
|
font: {
|
||||||
size: 16
|
size: 16
|
||||||
}
|
}
|
||||||
|
|
|
@ -12,8 +12,6 @@ nav_exclude: true
|
||||||
# LLMs are bad at returning code in JSON
|
# LLMs are bad at returning code in JSON
|
||||||
|
|
||||||
|
|
||||||
## Abstract
|
|
||||||
|
|
||||||
Current LLMs have support for returning properly formatted JSON,
|
Current LLMs have support for returning properly formatted JSON,
|
||||||
making it easier for clients to reliably parse complex responses.
|
making it easier for clients to reliably parse complex responses.
|
||||||
It therefore seems attractive for
|
It therefore seems attractive for
|
||||||
|
@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability
|
||||||
to correctly solve coding tasks.
|
to correctly solve coding tasks.
|
||||||
On a variant of the aider code editing benchmark,
|
On a variant of the aider code editing benchmark,
|
||||||
asking for JSON-wrapped code
|
asking for JSON-wrapped code
|
||||||
often significantly harms coding
|
often harms coding performance.
|
||||||
performance.
|
|
||||||
This holds true across many top coding LLMs,
|
This holds true across many top coding LLMs,
|
||||||
including OpenAI's latest model gpt-4o-2024-08-06
|
including OpenAI's latest model gpt-4o-2024-08-06
|
||||||
which has strong JSON support.
|
which has strong JSON support.
|
||||||
|
@ -36,7 +33,7 @@ which has strong JSON support.
|
||||||
> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call.
|
> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call.
|
||||||
|
|
||||||
|
|
||||||
## Introduction
|
## Background
|
||||||
|
|
||||||
A lot of people wonder why aider doesn't use LLM tools for code editing.
|
A lot of people wonder why aider doesn't use LLM tools for code editing.
|
||||||
Instead, aider asks for code edits in plain text, like this:
|
Instead, aider asks for code edits in plain text, like this:
|
||||||
|
@ -66,14 +63,17 @@ which would return a structured JSON response:
|
||||||
```
|
```
|
||||||
|
|
||||||
This has become even more tempting as LLM providers
|
This has become even more tempting as LLM providers
|
||||||
continue to improve their tooling for reliably generating
|
continue to improve their tooling for reliably generating JSON.
|
||||||
valid JSON.
|
For example,
|
||||||
For example, OpenAI recently announced the ability to
|
[OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/)
|
||||||
[strictly enforce that JSON responses will be syntactically correct
|
the ability to
|
||||||
and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/).
|
strictly enforce that JSON responses will be syntactically correct
|
||||||
|
and conform to a specified schema.
|
||||||
|
|
||||||
|
|
||||||
But producing valid (schema compliant) JSON is not sufficient for working with AI generated code.
|
But producing valid (schema compliant) JSON is not sufficient for working with AI generated code.
|
||||||
The code inside the JSON has to be valid and high quality too.
|
The code inside the JSON has to correctly solve the requested task
|
||||||
|
and be free from syntax errors.
|
||||||
Unfortunately,
|
Unfortunately,
|
||||||
LLMs write worse code when they're asked to
|
LLMs write worse code when they're asked to
|
||||||
wrap it in JSON.
|
wrap it in JSON.
|
||||||
|
@ -108,29 +108,23 @@ capable models.
|
||||||
OpenAI's newly announced support for "strict" JSON seemed like a good reason to
|
OpenAI's newly announced support for "strict" JSON seemed like a good reason to
|
||||||
investigate whether the newest models are still handicapped by JSON-wrapping code.
|
investigate whether the newest models are still handicapped by JSON-wrapping code.
|
||||||
|
|
||||||
Four of the strongest code editing models were benchmarked
|
The results presented here were based on
|
||||||
to assess the impact of JSON-wrapping code:
|
the
|
||||||
|
[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark)
|
||||||
|
of 133 practice exercises from the Exercism python repository.
|
||||||
|
Models were
|
||||||
|
restricted to a single attempt,
|
||||||
|
without a second try to fix errors as is normal in the aider benchmark.
|
||||||
|
|
||||||
- claude-3-5-sonnet-20240620
|
The performance of each model was compared across different strategies for returning code:
|
||||||
- deepseek-coder (V2 0724)
|
|
||||||
- gpt-4o-2024-05-13
|
|
||||||
- gpt-4o-2024-08-06
|
|
||||||
|
|
||||||
Each model was given one try to solve
|
|
||||||
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
|
|
||||||
This is the standard aider "code editing" benchmark, but restricted to a single attempt
|
|
||||||
without a second try to "fix" any errors.
|
|
||||||
|
|
||||||
The benchmark assessed the models coding ability
|
|
||||||
using different strategies for returning code:
|
|
||||||
|
|
||||||
- **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences.
|
- **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences.
|
||||||
- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON.
|
- **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON.
|
||||||
- **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting.
|
- **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting.
|
||||||
|
|
||||||
The markdown strategy is the same as
|
The markdown strategy is the same as
|
||||||
aider's "whole" edit format, where the
|
aider's "whole" edit format, where the
|
||||||
LLM would return a source file like this:
|
LLM returns a source file like this:
|
||||||
|
|
||||||
````
|
````
|
||||||
Here is the program you asked for which prints "Hello":
|
Here is the program you asked for which prints "Hello":
|
||||||
|
@ -163,13 +157,20 @@ than correctly formulating
|
||||||
instructions to edit
|
instructions to edit
|
||||||
portions of a file.
|
portions of a file.
|
||||||
|
|
||||||
This experimental setup is designed to highlight
|
This experimental setup is designed to quantify
|
||||||
the effects of JSON-wrapping on the LLMs ability to write code to solve a task.
|
the effects of JSON-wrapping on the LLMs ability to write code to solve a task.
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
Each of the 4 models was benchmarked 5 times using the different
|
Four of the strongest code editing models were benchmarked
|
||||||
strategies for returning code.
|
to assess the impact of JSON-wrapping code:
|
||||||
|
|
||||||
|
- claude-3-5-sonnet-20240620
|
||||||
|
- deepseek-coder (V2 0724)
|
||||||
|
- gpt-4o-2024-05-13
|
||||||
|
- gpt-4o-2024-08-06
|
||||||
|
|
||||||
|
Each combination of model and code wrapping strategy was benchmarked 5 times.
|
||||||
|
|
||||||
## Overall coding skill
|
## Overall coding skill
|
||||||
|
|
||||||
|
@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result.
|
||||||
|
|
||||||
## Syntax errors
|
## Syntax errors
|
||||||
|
|
||||||
Figure 2 shows the number of syntactic errors found in the code produced by each
|
Models tend to make more syntax errors when asked to wrap code in JSON.
|
||||||
model and code wrapping strategy.
|
Figure 2 shows the number of syntax errors found in the code produced by each
|
||||||
Models tend to make more syntactic errors when asked to wrap code in JSON.
|
model and code wrapping strategy,
|
||||||
|
totaling up `SyntaxError` and `IndentationError` errors from all 5 runs.
|
||||||
|
|
||||||
Sonnet avoided syntactic errors regardless of the code wrapping strategy,
|
|
||||||
but its benchmark scores in Figure 1 were lower with JSON.
|
Sonnet's results seems to indicate that the negative effects of JSON-wrapping
|
||||||
This seems to indicate that JSON-wrapping
|
go beyond syntactic difficulties.
|
||||||
does more than simply raise the syntactic difficulty in coding.
|
Sonnet avoided syntax errors regardless of the code wrapping strategy,
|
||||||
It may distract or challenge the model in a way that
|
but its benchmark scores in Figure 1 were nonetheless lower with JSON.
|
||||||
reduces its ability to reason about coding problems.
|
This implies that JSON-wrapping may distract or challenge models in a way that
|
||||||
|
reduces their ability to reason about solving coding problems.
|
||||||
|
|
||||||
{% include code-in-json-syntax.js %}
|
{% include code-in-json-syntax.js %}
|
||||||
|
|
||||||
> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code,
|
> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code,
|
||||||
> totaled from 5 runs.
|
> totaled from 5 runs.
|
||||||
> Models tend to make more syntactic errors when asked to wrap code in JSON.
|
> Models tend to make more syntax and formatting errors when asked to wrap code in JSON.
|
||||||
|
|
||||||
|
|
||||||
## Conclusions
|
## Conclusions
|
||||||
|
@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON.
|
||||||
|
|
||||||
OpenAI appears to be making progress in allowing LLMs to return code in
|
OpenAI appears to be making progress in allowing LLMs to return code in
|
||||||
structured JSON responses without harming the code quality.
|
structured JSON responses without harming the code quality.
|
||||||
But it seems premature to consider switching from plain text
|
But it still seems premature to consider switching from plain text
|
||||||
to JSON-wrapped code.
|
to JSON-wrapped code.
|
||||||
|
|
||||||
|
|
||||||
|
@ -227,4 +230,4 @@ The results presented here are not directly comparable to results
|
||||||
from the main
|
from the main
|
||||||
[aider LLM leaderboard](https://aider.chat/docs/leaderboards/).
|
[aider LLM leaderboard](https://aider.chat/docs/leaderboards/).
|
||||||
A number of settings were changed to simplify the benchmark
|
A number of settings were changed to simplify the benchmark
|
||||||
in order to focus on comparing plain text and JSON wrapped code.
|
in order to focus on comparing plain text and JSON-wrapped code.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue