mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-31 17:55:01 +00:00
Update 2024-08-14-code-in-json.md
This commit is contained in:
parent
b3ed2c8a48
commit
a951a2afc9
1 changed files with 29 additions and 19 deletions
|
@ -116,14 +116,13 @@ valid json, or even enforcing that it meets a specific schema.
|
|||
For example, OpenAI recently announced
|
||||
[strict enforcement of json responses]().
|
||||
|
||||
The problem is that LLMs are bad a writing code when you ask them to wrap it
|
||||
into a json container.
|
||||
The json tooling around the LLM helps make sure it's valid json,
|
||||
which does solve an important problem.
|
||||
LLMs used to frequently produce invalid json, so that's a big step forward.
|
||||
|
||||
The problem remains, LLMs write worse code when they're asked to
|
||||
But it's not sufficient to just produce
|
||||
valid json, it also
|
||||
has to contain quality code.
|
||||
Unfortunately,
|
||||
LLMs write worse code when they're asked to
|
||||
emit it wrapped in json.
|
||||
|
||||
In some sense this shouldn't be surprising.
|
||||
Just look at the very simple
|
||||
json example above, with the escaped
|
||||
|
@ -140,12 +139,17 @@ typing it into a text file or hand typing it as a properly escaped json string?
|
|||
|
||||
Previous [benchmark results](/2023/07/02/benchmarks.html)
|
||||
showed
|
||||
the superiority of plain text coding compared to json-wrapped function calls,
|
||||
but they were done over a year ago.
|
||||
the superiority of returning code
|
||||
as plain text coding compared to json-wrapped function calls.
|
||||
But those results were obtained
|
||||
over a year ago, against far less
|
||||
capable models.
|
||||
OpenAI's newly announced support for "strict" json seemed like a good reason to
|
||||
investigate whether the newest models are still handicapped by json-wrapping code.
|
||||
|
||||
To find out, I benchmarked 3 of the strongest code editing models:
|
||||
The graph above shows benchmark
|
||||
results from
|
||||
3 of the strongest code editing models:
|
||||
|
||||
- gpt-4o-2024-08-06
|
||||
- claude-3-5-sonnet-20240620
|
||||
|
@ -155,15 +159,18 @@ Each model was given one try to solve
|
|||
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
|
||||
This is the standard aider "code editing" benchmark, except restricted to a single attempt.
|
||||
|
||||
Each model ran through the benchmark with two strategies for returning code:
|
||||
Each model was assessed by the benchmark with two
|
||||
different strategies for returning code:
|
||||
|
||||
- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences.
|
||||
- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences.
|
||||
- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json.
|
||||
|
||||
The markdown strategy would return a program like this:
|
||||
The markdown strategy is the same as
|
||||
aider's "whole" edit format.
|
||||
It asks the LLM to return a program like this:
|
||||
|
||||
````
|
||||
Here is the program you asked for which prints "Hello, world!":
|
||||
Here is the program you asked for which prints "Hello":
|
||||
|
||||
greeting.py
|
||||
```
|
||||
|
@ -177,18 +184,21 @@ two parameters, like this:
|
|||
|
||||
```
|
||||
{
|
||||
"explanation": "Here is the program you asked for which prints \"Hello, world!\"",
|
||||
"explanation": "Here is the program you asked for which prints \"Hello\"",
|
||||
"content": "def greeting():\n print(\"Hello\")\n"
|
||||
}
|
||||
```
|
||||
|
||||
Both of these formats avoid actually *editing* source files, to keep things as
|
||||
Both of these formats avoid actually *editing* source files, to keep
|
||||
the task as
|
||||
simple as possible.
|
||||
This makes the task much easier, since the LLM can emit the whole source file intact.
|
||||
LLMs find it much more challenging to correctly formulate instructions to edit
|
||||
The LLM can emit the whole source file intact,
|
||||
which is much easier
|
||||
than correctly formulating
|
||||
instructions to edit
|
||||
portions of a file.
|
||||
|
||||
We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks.
|
||||
We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task.
|
||||
|
||||
## Results
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue