mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
Update 2024-08-14-code-in-json.md
This commit is contained in:
parent
b3ed2c8a48
commit
a951a2afc9
1 changed files with 29 additions and 19 deletions
|
@ -116,14 +116,13 @@ valid json, or even enforcing that it meets a specific schema.
|
||||||
For example, OpenAI recently announced
|
For example, OpenAI recently announced
|
||||||
[strict enforcement of json responses]().
|
[strict enforcement of json responses]().
|
||||||
|
|
||||||
The problem is that LLMs are bad a writing code when you ask them to wrap it
|
But it's not sufficient to just produce
|
||||||
into a json container.
|
valid json, it also
|
||||||
The json tooling around the LLM helps make sure it's valid json,
|
has to contain quality code.
|
||||||
which does solve an important problem.
|
Unfortunately,
|
||||||
LLMs used to frequently produce invalid json, so that's a big step forward.
|
LLMs write worse code when they're asked to
|
||||||
|
|
||||||
The problem remains, LLMs write worse code when they're asked to
|
|
||||||
emit it wrapped in json.
|
emit it wrapped in json.
|
||||||
|
|
||||||
In some sense this shouldn't be surprising.
|
In some sense this shouldn't be surprising.
|
||||||
Just look at the very simple
|
Just look at the very simple
|
||||||
json example above, with the escaped
|
json example above, with the escaped
|
||||||
|
@ -140,12 +139,17 @@ typing it into a text file or hand typing it as a properly escaped json string?
|
||||||
|
|
||||||
Previous [benchmark results](/2023/07/02/benchmarks.html)
|
Previous [benchmark results](/2023/07/02/benchmarks.html)
|
||||||
showed
|
showed
|
||||||
the superiority of plain text coding compared to json-wrapped function calls,
|
the superiority of returning code
|
||||||
but they were done over a year ago.
|
as plain text coding compared to json-wrapped function calls.
|
||||||
|
But those results were obtained
|
||||||
|
over a year ago, against far less
|
||||||
|
capable models.
|
||||||
OpenAI's newly announced support for "strict" json seemed like a good reason to
|
OpenAI's newly announced support for "strict" json seemed like a good reason to
|
||||||
investigate whether the newest models are still handicapped by json-wrapping code.
|
investigate whether the newest models are still handicapped by json-wrapping code.
|
||||||
|
|
||||||
To find out, I benchmarked 3 of the strongest code editing models:
|
The graph above shows benchmark
|
||||||
|
results from
|
||||||
|
3 of the strongest code editing models:
|
||||||
|
|
||||||
- gpt-4o-2024-08-06
|
- gpt-4o-2024-08-06
|
||||||
- claude-3-5-sonnet-20240620
|
- claude-3-5-sonnet-20240620
|
||||||
|
@ -155,15 +159,18 @@ Each model was given one try to solve
|
||||||
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
|
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
|
||||||
This is the standard aider "code editing" benchmark, except restricted to a single attempt.
|
This is the standard aider "code editing" benchmark, except restricted to a single attempt.
|
||||||
|
|
||||||
Each model ran through the benchmark with two strategies for returning code:
|
Each model was assessed by the benchmark with two
|
||||||
|
different strategies for returning code:
|
||||||
|
|
||||||
- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences.
|
- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences.
|
||||||
- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json.
|
- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json.
|
||||||
|
|
||||||
The markdown strategy would return a program like this:
|
The markdown strategy is the same as
|
||||||
|
aider's "whole" edit format.
|
||||||
|
It asks the LLM to return a program like this:
|
||||||
|
|
||||||
````
|
````
|
||||||
Here is the program you asked for which prints "Hello, world!":
|
Here is the program you asked for which prints "Hello":
|
||||||
|
|
||||||
greeting.py
|
greeting.py
|
||||||
```
|
```
|
||||||
|
@ -177,18 +184,21 @@ two parameters, like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
{
|
{
|
||||||
"explanation": "Here is the program you asked for which prints \"Hello, world!\"",
|
"explanation": "Here is the program you asked for which prints \"Hello\"",
|
||||||
"content": "def greeting():\n print(\"Hello\")\n"
|
"content": "def greeting():\n print(\"Hello\")\n"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Both of these formats avoid actually *editing* source files, to keep things as
|
Both of these formats avoid actually *editing* source files, to keep
|
||||||
|
the task as
|
||||||
simple as possible.
|
simple as possible.
|
||||||
This makes the task much easier, since the LLM can emit the whole source file intact.
|
The LLM can emit the whole source file intact,
|
||||||
LLMs find it much more challenging to correctly formulate instructions to edit
|
which is much easier
|
||||||
|
than correctly formulating
|
||||||
|
instructions to edit
|
||||||
portions of a file.
|
portions of a file.
|
||||||
|
|
||||||
We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks.
|
We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task.
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue