aider/aider/website/_posts/2024-08-14-code-in-json.md
2024-08-14 18:56:01 -07:00

206 lines
5.8 KiB
Markdown

---
title: LLMs are bad at returning code in json
excerpt: LLMs write worse code if you ask them to return the code wrapped in json via a tool/function call.
highlight_image: /assets/code-in-json.jpg
draft: true
nav_exclude: true
---
{% if page.date %}
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
{% endif %}
# LLMs are bad at returning code in json
<canvas id="passRateChart" width="800" height="400" style="margin-bottom: 20px"></canvas>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script>
document.addEventListener('DOMContentLoaded', function () {
var ctx = document.getElementById('passRateChart').getContext('2d');
var yamlData = {{ site.data.code-in-json | jsonify }};
var models = [...new Set(yamlData.map(item => item.model))];
var editFormats = [...new Set(yamlData.map(item => item.edit_format))];
var datasets = editFormats.map(format => ({
label: format,
data: models.map(model => {
var item = yamlData.find(d => d.model === model && d.edit_format === format);
return item ? item.pass_rate_1 : null;
}),
backgroundColor: format === 'Markdown' ? 'rgba(54, 162, 235, 0.8)' :
format === 'Tool call' ? 'rgba(255, 99, 132, 0.8)' :
'rgba(75, 192, 192, 0.8)',
}));
var data = {
labels: models,
datasets: datasets
};
var config = {
type: 'bar',
data: data,
options: {
responsive: true,
scales: {
x: {
title: {
display: true,
text: 'Model'
}
},
y: {
beginAtZero: true,
title: {
display: true,
text: 'Pass Rate (%)'
},
max: 70
}
},
plugins: {
title: {
display: true,
text: 'Pass rate by model and code return strategy',
font: {
size: 16
}
},
legend: {
position: 'top',
}
}
}
};
new Chart(ctx, config);
});
</script>
A lot of people wonder why aider doesn't have LLMs use tools or function calls to
specify code edits.
Instead, aider asks LLMs to return code edits in plain text, like this:
````
greeting.py
```python
<<<<<<< SEARCH
def greeting():
print("Hello")
=======
def greeting():
print("Goodbye")
>>>>>>> REPLACE
```
````
People expect that it would be easier and more reliable
for aider to parse a nicely formatted json
response more like this:
```
{
"filename": "greeting.py",
"start_line": 6,
"end_line": 7,
"new_content": "def greeting():\n print(\"Goodbye\")\n"
}
```
This seems even more tempting as LLMs get better tooling for reliably generating
valid json, or even enforcing that it meets a specific schema.
For example, OpenAI recently announced
[strict enforcement of json responses]().
But it's not sufficient to just produce
valid json, it also
has to contain quality code.
Unfortunately,
LLMs write worse code when they're asked to
emit it wrapped in json.
In some sense this shouldn't be surprising.
Just look at the very simple
json example above, with the escaped
quotes `\"` quotes
newlines `\n`
mixed into the code.
Coding is complicated enough without having to escape all the special characters too.
If I asked you to write me a program, would you do a better job
typing it into a text file or hand typing it as a properly escaped json string?
## Quantifying the benefits of plain text
Previous [benchmark results](/2023/07/02/benchmarks.html)
showed
the superiority of returning code
as plain text coding compared to json-wrapped function calls.
But those results were obtained
over a year ago, against far less
capable models.
OpenAI's newly announced support for "strict" json seemed like a good reason to
investigate whether the newest models are still handicapped by json-wrapping code.
The graph above shows benchmark
results from
3 of the strongest code editing models:
- gpt-4o-2024-08-06
- claude-3-5-sonnet-20240620
- deepseek-coder (V2 0724)
Each model was given one try to solve
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
This is the standard aider "code editing" benchmark, except restricted to a single attempt.
Each model was assessed by the benchmark with two
different strategies for returning code:
- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences.
- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json.
The markdown strategy is the same as
aider's "whole" edit format.
It asks the LLM to return a program like this:
````
Here is the program you asked for which prints "Hello":
greeting.py
```
def greeting():
print("Hello")
```
````
The tool strategy requires the LLM to call the `write_file` function with
two parameters, like this:
```
{
"explanation": "Here is the program you asked for which prints \"Hello\"",
"content": "def greeting():\n print(\"Hello\")\n"
}
```
Both of these formats avoid actually *editing* source files, to keep
the task as
simple as possible.
The LLM can emit the whole source file intact,
which is much easier
than correctly formulating
instructions to edit
portions of a file.
We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task.
## Results
All 3 models did significantly worse on the benchmark when asked to
return json-wrapped code in a tool function call.