Update 2024-08-14-code-in-json.md

This commit is contained in:
paul-gauthier 2024-08-14 18:56:01 -07:00 committed by GitHub
parent b3ed2c8a48
commit a951a2afc9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -116,14 +116,13 @@ valid json, or even enforcing that it meets a specific schema.
For example, OpenAI recently announced For example, OpenAI recently announced
[strict enforcement of json responses](). [strict enforcement of json responses]().
The problem is that LLMs are bad a writing code when you ask them to wrap it But it's not sufficient to just produce
into a json container. valid json, it also
The json tooling around the LLM helps make sure it's valid json, has to contain quality code.
which does solve an important problem. Unfortunately,
LLMs used to frequently produce invalid json, so that's a big step forward. LLMs write worse code when they're asked to
The problem remains, LLMs write worse code when they're asked to
emit it wrapped in json. emit it wrapped in json.
In some sense this shouldn't be surprising. In some sense this shouldn't be surprising.
Just look at the very simple Just look at the very simple
json example above, with the escaped json example above, with the escaped
@ -140,12 +139,17 @@ typing it into a text file or hand typing it as a properly escaped json string?
Previous [benchmark results](/2023/07/02/benchmarks.html) Previous [benchmark results](/2023/07/02/benchmarks.html)
showed showed
the superiority of plain text coding compared to json-wrapped function calls, the superiority of returning code
but they were done over a year ago. as plain text coding compared to json-wrapped function calls.
But those results were obtained
over a year ago, against far less
capable models.
OpenAI's newly announced support for "strict" json seemed like a good reason to OpenAI's newly announced support for "strict" json seemed like a good reason to
investigate whether the newest models are still handicapped by json-wrapping code. investigate whether the newest models are still handicapped by json-wrapping code.
To find out, I benchmarked 3 of the strongest code editing models: The graph above shows benchmark
results from
3 of the strongest code editing models:
- gpt-4o-2024-08-06 - gpt-4o-2024-08-06
- claude-3-5-sonnet-20240620 - claude-3-5-sonnet-20240620
@ -155,15 +159,18 @@ Each model was given one try to solve
[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). [133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark).
This is the standard aider "code editing" benchmark, except restricted to a single attempt. This is the standard aider "code editing" benchmark, except restricted to a single attempt.
Each model ran through the benchmark with two strategies for returning code: Each model was assessed by the benchmark with two
different strategies for returning code:
- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences. - **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences.
- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json. - **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json.
The markdown strategy would return a program like this: The markdown strategy is the same as
aider's "whole" edit format.
It asks the LLM to return a program like this:
```` ````
Here is the program you asked for which prints "Hello, world!": Here is the program you asked for which prints "Hello":
greeting.py greeting.py
``` ```
@ -177,18 +184,21 @@ two parameters, like this:
``` ```
{ {
"explanation": "Here is the program you asked for which prints \"Hello, world!\"", "explanation": "Here is the program you asked for which prints \"Hello\"",
"content": "def greeting():\n print(\"Hello\")\n" "content": "def greeting():\n print(\"Hello\")\n"
} }
``` ```
Both of these formats avoid actually *editing* source files, to keep things as Both of these formats avoid actually *editing* source files, to keep
the task as
simple as possible. simple as possible.
This makes the task much easier, since the LLM can emit the whole source file intact. The LLM can emit the whole source file intact,
LLMs find it much more challenging to correctly formulate instructions to edit which is much easier
than correctly formulating
instructions to edit
portions of a file. portions of a file.
We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks. We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task.
## Results ## Results