From b2211c4a58274a75e3c5921d8bb55510b5f9cca4 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 14 Aug 2024 16:41:08 -0700 Subject: [PATCH 01/34] initial --- .../website/_posts/2024-08-14-code-in-json.md | 127 ++++++++++++++++++ 1 file changed, 127 insertions(+) create mode 100644 aider/website/_posts/2024-08-14-code-in-json.md diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md new file mode 100644 index 000000000..7fc8919db --- /dev/null +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -0,0 +1,127 @@ +--- +title: LLMs are bad at returning code in json +excerpt: LLMs write worse code if you ask them to return the code wrapped in json via a tool/function call. +highlight_image: /assets/code-in-json.jpg +draft: true +nav_exclude: true +--- +{% if page.date %} +

{{ page.date | date: "%B %d, %Y" }}

+{% endif %} + +# LLMs are bad at returning code in json + + +A lot of people wonder why aider doesn't have LLMs use tools or function calls to +specify code edits. +Instead, aider asks LLMs to return code edits in plain text, like this: + +```` +greeting.py +```python +<<<<<<< SEARCH +def greeting(): + print("Hello") +======= +def greeting(): + print("Goodbye") +>>>>>>> REPLACE +``` +```` + +People expect that it would be easier and more reliable +for aider to parse a nicely formatted json +response more like this: + +``` +{ + "filename": "greeting.py", + "start_line": 6, + "end_line": 7, + "new_content": "def greeting():\n print(\"Goodbye\")\n" +} +``` + +This seems even more tempting as LLMs get better tooling for reliably generating +valid json, or even enforcing that it meets a specific schema. +For example, OpenAI recently announced +[strict enforcement of json responses](). + +The problem is that LLMs are bad a writing code when you ask them to wrap it +into a json container. +The json tooling around the LLM helps make sure it's valid json, +which does solve an important problem. +LLMs used to frequently produce invalid json, so that's a big step forward. + +The problem remains, LLMs write worse code when they're asked to +emit it wrapped in json. +In some sense this shouldn't be surprising. +Just look at the very simple +json example above, with the escaped +quotes `\"` quotes +newlines `\n` +mixed into the code. +Coding is complicated enough without having to escape all the special characters too. + +If I asked you to write me a program, would you do a better job +typing it into a text file or hand typing it as a properly escaped json string? + +## Quantifying the benefits of plain text + + +Previous [benchmark results](/2023/07/02/benchmarks.html) +showed +the superiority of plain text coding compared to json-wrapped function calls, +but they were done over a year ago. +OpenAI's newly announced support for "strict" json seemed like a good reason to +investigate whether the newest models are still handicapped by json-wrapping code. + +To find out, I benchmarked 3 of the strongest code editing models: + +- gpt-4o-2024-08-06 +- claude-3-5-sonnet-20240620 +- deepseek-coder (V2 0724) + +Each model was given one try to solve +[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). +This is the standard aider "code editing" benchmark, except restricted to a single attempt. + +Each model ran through the benchmark with two strategies for returning code: + +- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences. +- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json. + +The markdown strategy would return a program like this: + +```` +Here is the program you asked for which prints "Hello, world!": + +greeting.py +``` +def greeting(): + print("Hello") +``` +```` + +The tool strategy requires the LLM to call the `write_file` function with +two parameters, like this: + +``` +{ + "explanation": "Here is the program you asked for which prints \"Hello, world!\"", + "content": "def greeting():\n print(\"Hello\")\n" +} +``` + +Both of these formats avoid actually *editing* source files, to keep things as +simple as possible. +This makes the task much easier, since the LLM can emit the whole source file intact. +LLMs find it much more challenging to correctly formulate instructions to edit +portions of a file. + +We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks. + +## Results + +All 3 models did significantly worse on the benchmark when asked to +return json-wrapped code in a tool function call. From 205a503d64ee655f8d284406beb655567fc97d2e Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 14 Aug 2024 16:41:22 -0700 Subject: [PATCH 02/34] init --- aider/website/_data/code-in-json.yml | 154 +++++++++++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 aider/website/_data/code-in-json.yml diff --git a/aider/website/_data/code-in-json.yml b/aider/website/_data/code-in-json.yml new file mode 100644 index 000000000..c4ed8d073 --- /dev/null +++ b/aider/website/_data/code-in-json.yml @@ -0,0 +1,154 @@ +- dirname: 2024-08-14-18-38-25--json-gpt-4o-2024-08-06-non-strict-func + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: Tool call + commit_hash: 2eb1946-dirty + pass_rate_1: 54.1 + percent_cases_well_formed: 100.0 + error_outputs: 7 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 2 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 4 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 11.5 + total_cost: 1.3819 + +- dirname: 2024-08-14-18-32-02--json-gpt-4o-2024-08-06-strict-func + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: Tool call (strict) + commit_hash: 2eb1946 + pass_rate_1: 56.4 + percent_cases_well_formed: 100.0 + error_outputs: 1 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 7 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 4 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 12.7 + total_cost: 1.3652 + +- dirname: 2024-08-14-18-26-18--json-gpt-4o-2024-08-06-whole + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: Markdown + commit_hash: 94a2601-dirty + pass_rate_1: 62.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 6.8 + total_cost: 1.2717 + +- dirname: 2024-08-14-20-19-23--json-sonnet-non-strict-func + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: Tool call + commit_hash: e2f14a2 + pass_rate_1: 52.6 + percent_cases_well_formed: 100.0 + error_outputs: 1 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 1 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 18.9 + total_cost: 2.6341 + +- dirname: 2024-08-14-20-15-19--json-sonnet-whole + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: Markdown + commit_hash: e2f14a2 + pass_rate_1: 58.6 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 19.7 + total_cost: 2.5335 + +- dirname: 2024-08-14-21-20-46--json-deepseek-non-strict-func + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: Tool call + commit_hash: e2f14a2 + pass_rate_1: 54.1 + percent_cases_well_formed: 100.0 + error_outputs: 9 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 5 + lazy_comments: 0 + syntax_errors: 2 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 17.4 + total_cost: 0.0332 + +- dirname: 2024-08-14-21-23-27--json-deepseek-whole + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: Markdown + commit_hash: e2f14a2 + pass_rate_1: 61.7 + percent_cases_well_formed: 100.0 + error_outputs: 1 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 1 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 23.0 + total_cost: 0.0439 + From 957524680a2e6c7044009067a6561aaa87e78704 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Wed, 14 Aug 2024 16:44:43 -0700 Subject: [PATCH 03/34] feat: Add bar graph to plot pass_rate_1 by model and edit_format --- .../website/_posts/2024-08-14-code-in-json.md | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 7fc8919db..2cb18662b 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -9,6 +9,74 @@ nav_exclude: true {% endif %} + + + + + # LLMs are bad at returning code in json From 7310f0928f919c493d4327b86f53dd4e2af960e8 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Wed, 14 Aug 2024 16:46:00 -0700 Subject: [PATCH 04/34] feat: Fetch data from YAML file for chart --- .../website/_posts/2024-08-14-code-in-json.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 2cb18662b..1e7e729c6 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -16,25 +16,25 @@ nav_exclude: true document.addEventListener('DOMContentLoaded', function () { var ctx = document.getElementById('passRateChart').getContext('2d'); + var yamlData = {{ site.data.code-in-json | jsonify }}; + + var models = [...new Set(yamlData.map(item => item.model))]; + var editFormats = [...new Set(yamlData.map(item => item.edit_format))]; + + var datasets = editFormats.map(format => ({ + label: format, + data: models.map(model => { + var item = yamlData.find(d => d.model === model && d.edit_format === format); + return item ? item.pass_rate_1 : null; + }), + backgroundColor: format === 'Markdown' ? 'rgba(54, 162, 235, 0.8)' : + format === 'Tool call' ? 'rgba(255, 99, 132, 0.8)' : + 'rgba(75, 192, 192, 0.8)', + })); + var data = { - labels: ['gpt-4o-2024-08-06', 'claude-3.5-sonnet', 'deepseek-coder'], - datasets: [ - { - label: 'Markdown', - data: [62.4, 58.6, 61.7], - backgroundColor: 'rgba(54, 162, 235, 0.8)', - }, - { - label: 'Tool call', - data: [54.1, 52.6, 54.1], - backgroundColor: 'rgba(255, 99, 132, 0.8)', - }, - { - label: 'Tool call (strict)', - data: [56.4, null, null], - backgroundColor: 'rgba(75, 192, 192, 0.8)', - } - ] + labels: models, + datasets: datasets }; var config = { From b3ed2c8a48a97e17ddbba6578ab519e983a81d24 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 14 Aug 2024 16:50:14 -0700 Subject: [PATCH 05/34] copy --- aider/website/_data/code-in-json.yml | 114 +++++++++--------- .../website/_posts/2024-08-14-code-in-json.md | 9 +- 2 files changed, 62 insertions(+), 61 deletions(-) diff --git a/aider/website/_data/code-in-json.yml b/aider/website/_data/code-in-json.yml index c4ed8d073..64c42a2d5 100644 --- a/aider/website/_data/code-in-json.yml +++ b/aider/website/_data/code-in-json.yml @@ -1,3 +1,25 @@ +- dirname: 2024-08-14-18-26-18--json-gpt-4o-2024-08-06-whole + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: Markdown + commit_hash: 94a2601-dirty + pass_rate_1: 62.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 6.8 + total_cost: 1.2717 + - dirname: 2024-08-14-18-38-25--json-gpt-4o-2024-08-06-non-strict-func test_cases: 133 model: gpt-4o-2024-08-06 @@ -42,53 +64,9 @@ seconds_per_case: 12.7 total_cost: 1.3652 -- dirname: 2024-08-14-18-26-18--json-gpt-4o-2024-08-06-whole - test_cases: 133 - model: gpt-4o-2024-08-06 - edit_format: Markdown - commit_hash: 94a2601-dirty - pass_rate_1: 62.4 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 0 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 3 - command: aider --model gpt-4o-2024-08-06 - date: 2024-08-14 - versions: 0.50.2-dev - seconds_per_case: 6.8 - total_cost: 1.2717 - -- dirname: 2024-08-14-20-19-23--json-sonnet-non-strict-func - test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: Tool call - commit_hash: e2f14a2 - pass_rate_1: 52.6 - percent_cases_well_formed: 100.0 - error_outputs: 1 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 1 - lazy_comments: 0 - syntax_errors: 1 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet - date: 2024-08-14 - versions: 0.50.2-dev - seconds_per_case: 18.9 - total_cost: 2.6341 - - dirname: 2024-08-14-20-15-19--json-sonnet-whole test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet + model: claude-3.5-sonnet edit_format: Markdown commit_hash: e2f14a2 pass_rate_1: 58.6 @@ -102,37 +80,37 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-14 versions: 0.50.2-dev seconds_per_case: 19.7 total_cost: 2.5335 -- dirname: 2024-08-14-21-20-46--json-deepseek-non-strict-func +- dirname: 2024-08-14-20-19-23--json-sonnet-non-strict-func test_cases: 133 - model: openrouter/deepseek/deepseek-coder + model: claude-3.5-sonnet edit_format: Tool call commit_hash: e2f14a2 - pass_rate_1: 54.1 + pass_rate_1: 52.6 percent_cases_well_formed: 100.0 - error_outputs: 9 + error_outputs: 1 num_malformed_responses: 0 num_with_malformed_responses: 0 - user_asks: 5 + user_asks: 1 lazy_comments: 0 - syntax_errors: 2 + syntax_errors: 1 indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model claude-3.5-sonnet date: 2024-08-14 versions: 0.50.2-dev - seconds_per_case: 17.4 - total_cost: 0.0332 + seconds_per_case: 18.9 + total_cost: 2.6341 - dirname: 2024-08-14-21-23-27--json-deepseek-whole test_cases: 133 - model: openrouter/deepseek/deepseek-coder + model: deepseek-coder edit_format: Markdown commit_hash: e2f14a2 pass_rate_1: 61.7 @@ -146,9 +124,31 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-14 versions: 0.50.2-dev seconds_per_case: 23.0 total_cost: 0.0439 +- dirname: 2024-08-14-21-20-46--json-deepseek-non-strict-func + test_cases: 133 + model: deepseek-coder + edit_format: Tool call + commit_hash: e2f14a2 + pass_rate_1: 54.1 + percent_cases_well_formed: 100.0 + error_outputs: 9 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 5 + lazy_comments: 0 + syntax_errors: 2 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model deepseek-coder + date: 2024-08-14 + versions: 0.50.2-dev + seconds_per_case: 17.4 + total_cost: 0.0332 + diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 1e7e729c6..747eaa0cd 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -9,6 +9,9 @@ nav_exclude: true {% endif %} +# LLMs are bad at returning code in json + + @@ -55,13 +58,13 @@ document.addEventListener('DOMContentLoaded', function () { display: true, text: 'Pass Rate (%)' }, - max: 100 + max: 70 } }, plugins: { title: { display: true, - text: 'Pass Rate by Model and Edit Format', + text: 'Pass rate by model and code return strategy', font: { size: 16 } @@ -77,8 +80,6 @@ document.addEventListener('DOMContentLoaded', function () { }); -# LLMs are bad at returning code in json - A lot of people wonder why aider doesn't have LLMs use tools or function calls to specify code edits. From a951a2afc9ea535f9c021a26a497156421edb97a Mon Sep 17 00:00:00 2001 From: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com> Date: Wed, 14 Aug 2024 18:56:01 -0700 Subject: [PATCH 06/34] Update 2024-08-14-code-in-json.md --- .../website/_posts/2024-08-14-code-in-json.md | 48 +++++++++++-------- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 747eaa0cd..119059015 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -116,14 +116,13 @@ valid json, or even enforcing that it meets a specific schema. For example, OpenAI recently announced [strict enforcement of json responses](). -The problem is that LLMs are bad a writing code when you ask them to wrap it -into a json container. -The json tooling around the LLM helps make sure it's valid json, -which does solve an important problem. -LLMs used to frequently produce invalid json, so that's a big step forward. - -The problem remains, LLMs write worse code when they're asked to +But it's not sufficient to just produce +valid json, it also +has to contain quality code. +Unfortunately, +LLMs write worse code when they're asked to emit it wrapped in json. + In some sense this shouldn't be surprising. Just look at the very simple json example above, with the escaped @@ -140,12 +139,17 @@ typing it into a text file or hand typing it as a properly escaped json string? Previous [benchmark results](/2023/07/02/benchmarks.html) showed -the superiority of plain text coding compared to json-wrapped function calls, -but they were done over a year ago. +the superiority of returning code +as plain text coding compared to json-wrapped function calls. +But those results were obtained +over a year ago, against far less +capable models. OpenAI's newly announced support for "strict" json seemed like a good reason to investigate whether the newest models are still handicapped by json-wrapping code. -To find out, I benchmarked 3 of the strongest code editing models: +The graph above shows benchmark +results from +3 of the strongest code editing models: - gpt-4o-2024-08-06 - claude-3-5-sonnet-20240620 @@ -155,15 +159,18 @@ Each model was given one try to solve [133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). This is the standard aider "code editing" benchmark, except restricted to a single attempt. -Each model ran through the benchmark with two strategies for returning code: +Each model was assessed by the benchmark with two +different strategies for returning code: -- **Markdown** -- where the model simply returns the whole source code file in standard markdown triple-backtick fences. +- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences. - **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json. -The markdown strategy would return a program like this: +The markdown strategy is the same as +aider's "whole" edit format. +It asks the LLM to return a program like this: ```` -Here is the program you asked for which prints "Hello, world!": +Here is the program you asked for which prints "Hello": greeting.py ``` @@ -177,18 +184,21 @@ two parameters, like this: ``` { - "explanation": "Here is the program you asked for which prints \"Hello, world!\"", + "explanation": "Here is the program you asked for which prints \"Hello\"", "content": "def greeting():\n print(\"Hello\")\n" } ``` -Both of these formats avoid actually *editing* source files, to keep things as +Both of these formats avoid actually *editing* source files, to keep +the task as simple as possible. -This makes the task much easier, since the LLM can emit the whole source file intact. -LLMs find it much more challenging to correctly formulate instructions to edit +The LLM can emit the whole source file intact, +which is much easier +than correctly formulating +instructions to edit portions of a file. -We are simply testing the effects of json-wrapping on the LLMs ability to solve coding tasks. +We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task. ## Results From 9ab185a88fa1a5c52a9497e8b3767c99a13fd700 Mon Sep 17 00:00:00 2001 From: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com> Date: Wed, 14 Aug 2024 18:57:18 -0700 Subject: [PATCH 07/34] Update 2024-08-14-code-in-json.md --- aider/website/_posts/2024-08-14-code-in-json.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 119059015..71f789ed1 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -1,6 +1,6 @@ --- title: LLMs are bad at returning code in json -excerpt: LLMs write worse code if you ask them to return the code wrapped in json via a tool/function call. +excerpt: LLMs write worse code if you ask them to return the code wrapped in json (via a tool or function call). highlight_image: /assets/code-in-json.jpg draft: true nav_exclude: true From d0e716ea7da300499e31ae9671e3eaaf425de6b1 Mon Sep 17 00:00:00 2001 From: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com> Date: Wed, 14 Aug 2024 19:02:23 -0700 Subject: [PATCH 08/34] Update 2024-08-14-code-in-json.md --- aider/website/_posts/2024-08-14-code-in-json.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 71f789ed1..9721da004 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -100,7 +100,7 @@ def greeting(): People expect that it would be easier and more reliable for aider to parse a nicely formatted json -response more like this: +response, like this: ``` { @@ -111,7 +111,8 @@ response more like this: } ``` -This seems even more tempting as LLMs get better tooling for reliably generating +This seems even more tempting as LLMs +get better tooling for reliably generating valid json, or even enforcing that it meets a specific schema. For example, OpenAI recently announced [strict enforcement of json responses](). @@ -126,13 +127,16 @@ emit it wrapped in json. In some sense this shouldn't be surprising. Just look at the very simple json example above, with the escaped -quotes `\"` quotes +quotes `\"` and newlines `\n` mixed into the code. Coding is complicated enough without having to escape all the special characters too. -If I asked you to write me a program, would you do a better job -typing it into a text file or hand typing it as a properly escaped json string? +If you tried to write a program, +would you do a better job +typing it normally +or as a properly escaped +json string? ## Quantifying the benefits of plain text From 0a2d75b966a5f81d0309cc40990d4a55537e276f Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Wed, 14 Aug 2024 20:05:23 -0700 Subject: [PATCH 09/34] fix: Apply consistent color and striped pattern to "Tool call (strict)" --- aider/website/_posts/2024-08-14-code-in-json.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 9721da004..a2f52b0c2 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -31,8 +31,11 @@ document.addEventListener('DOMContentLoaded', function () { return item ? item.pass_rate_1 : null; }), backgroundColor: format === 'Markdown' ? 'rgba(54, 162, 235, 0.8)' : - format === 'Tool call' ? 'rgba(255, 99, 132, 0.8)' : + format.startsWith('Tool call') ? 'rgba(255, 99, 132, 0.8)' : 'rgba(75, 192, 192, 0.8)', + borderColor: format === 'Tool call (strict)' ? 'rgba(255, 255, 255, 0.8)' : null, + borderWidth: format === 'Tool call (strict)' ? 2 : 0, + borderDash: format === 'Tool call (strict)' ? [5, 5] : null, })); var data = { From a47a5c91794a2ebe1621acf341396d2b647d35aa Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 14 Aug 2024 20:07:09 -0700 Subject: [PATCH 10/34] fix: update code-in-json.md post with improved styling for code blocks --- aider/website/_posts/2024-08-14-code-in-json.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index a2f52b0c2..ed25f1056 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -33,9 +33,6 @@ document.addEventListener('DOMContentLoaded', function () { backgroundColor: format === 'Markdown' ? 'rgba(54, 162, 235, 0.8)' : format.startsWith('Tool call') ? 'rgba(255, 99, 132, 0.8)' : 'rgba(75, 192, 192, 0.8)', - borderColor: format === 'Tool call (strict)' ? 'rgba(255, 255, 255, 0.8)' : null, - borderWidth: format === 'Tool call (strict)' ? 2 : 0, - borderDash: format === 'Tool call (strict)' ? [5, 5] : null, })); var data = { From 9b2f317ba362f8bdb52b47540e169efc1892e874 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 14 Aug 2024 20:07:20 -0700 Subject: [PATCH 11/34] feat: Add function to create striped canvas pattern --- .../website/_posts/2024-08-14-code-in-json.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index ed25f1056..289fdc481 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -76,6 +76,28 @@ document.addEventListener('DOMContentLoaded', function () { } }; + function createStripedCanvas(isStrict) { + const patternCanvas = document.createElement('canvas'); + const patternContext = patternCanvas.getContext('2d'); + const size = 10; + patternCanvas.width = size; + patternCanvas.height = size; + + patternContext.fillStyle = 'rgba(255, 99, 132, 0.8)'; + patternContext.fillRect(0, 0, size, size); + + if (isStrict) { + patternContext.strokeStyle = 'rgba(255, 255, 255, 0.8)'; + patternContext.lineWidth = 2; + patternContext.beginPath(); + patternContext.moveTo(0, 0); + patternContext.lineTo(size, size); + patternContext.stroke(); + } + + return patternCanvas; + } + new Chart(ctx, config); }); From 23f89f1d29d4525c89da37273034863e0364f2a8 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Wed, 14 Aug 2024 20:07:21 -0700 Subject: [PATCH 12/34] feat: Add striped pattern for "Tool call (strict)" format --- aider/website/_posts/2024-08-14-code-in-json.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 289fdc481..6dd2d3e4b 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -30,9 +30,18 @@ document.addEventListener('DOMContentLoaded', function () { var item = yamlData.find(d => d.model === model && d.edit_format === format); return item ? item.pass_rate_1 : null; }), - backgroundColor: format === 'Markdown' ? 'rgba(54, 162, 235, 0.8)' : - format.startsWith('Tool call') ? 'rgba(255, 99, 132, 0.8)' : - 'rgba(75, 192, 192, 0.8)', + backgroundColor: function(context) { + const format = context.dataset.label; + if (format === 'Markdown') { + return 'rgba(54, 162, 235, 0.8)'; + } else if (format.startsWith('Tool call')) { + const ctx = context.chart.ctx; + const gradient = ctx.createPattern(createStripedCanvas(format === 'Tool call (strict)'), 'repeat'); + return gradient; + } else { + return 'rgba(75, 192, 192, 0.8)'; + } + }, })); var data = { From 6ef2b8c0fa1c6ec16f9dbe9a24502a0fadfe8bc0 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 06:05:38 -0700 Subject: [PATCH 13/34] copy --- aider/website/_posts/2024-08-14-code-in-json.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 6dd2d3e4b..f61643aa2 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -97,7 +97,7 @@ document.addEventListener('DOMContentLoaded', function () { if (isStrict) { patternContext.strokeStyle = 'rgba(255, 255, 255, 0.8)'; - patternContext.lineWidth = 2; + patternContext.lineWidth = 0.75; patternContext.beginPath(); patternContext.moveTo(0, 0); patternContext.lineTo(size, size); @@ -112,6 +112,21 @@ document.addEventListener('DOMContentLoaded', function () { +## Abstract + +The newest LLMs have explicit tooling and +support for returning properly formatted json responses. +While it is tempting to have LLMs use json tool or function calls to +return code or code edits, this is unwise. +LLMs write worse code when asked to wrap it in json, harming their ability +to correctly solve coding tasks. +Returning code as plain (markdown) text results in 6% higher scores +on a variant of the aider code editing benchmark. +Even OpenAI's newest gpt-4o-2024-08-06 with "strict" json support +suffers from this code-in-json handicap. + +## Introduction + A lot of people wonder why aider doesn't have LLMs use tools or function calls to specify code edits. Instead, aider asks LLMs to return code edits in plain text, like this: From ed6ebfbdb69269b04db3607a1ddfb685b3c6cf3e Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 08:08:56 -0700 Subject: [PATCH 14/34] fix: Update post on code in JSON --- .../website/_posts/2024-08-14-code-in-json.md | 89 ++++++++++--------- 1 file changed, 49 insertions(+), 40 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index f61643aa2..572aaf3be 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -114,22 +114,26 @@ document.addEventListener('DOMContentLoaded', function () { ## Abstract -The newest LLMs have explicit tooling and -support for returning properly formatted json responses. -While it is tempting to have LLMs use json tool or function calls to -return code or code edits, this is unwise. +The newest LLMs have support for returning properly formatted json responses, +making it easy for client applications to parse complex responses. +This makes it tempting for AI coding applications to +use tool function calls or other structured reply formats to +receive code from LLMs. +Unfortunately, LLMs write worse code when asked to wrap it in json, harming their ability to correctly solve coding tasks. -Returning code as plain (markdown) text results in 6% higher scores +Returning code as plain (markdown) text results in an average of 6% higher scores on a variant of the aider code editing benchmark. -Even OpenAI's newest gpt-4o-2024-08-06 with "strict" json support +This holds true across many top coding LLMs, +and even OpenAI's newest gpt-4o-2024-08-06 with "strict" json support suffers from this code-in-json handicap. ## Introduction -A lot of people wonder why aider doesn't have LLMs use tools or function calls to +A lot of people wonder why aider doesn't tell LLMs to +use tools or function calls to specify code edits. -Instead, aider asks LLMs to return code edits in plain text, like this: +Instead, aider asks for code edits in plain text, like this: ```` greeting.py @@ -144,31 +148,30 @@ def greeting(): ``` ```` -People expect that it would be easier and more reliable -for aider to parse a nicely formatted json -response, like this: +People expect that it would be easier and more reliable to use tool calls, +and parse a nicely formatted json +response: ``` { "filename": "greeting.py", - "start_line": 6, - "end_line": 7, - "new_content": "def greeting():\n print(\"Goodbye\")\n" + "search": "def greeting():\n print(\"Hello\")\n" + "replace": "def greeting():\n print(\"Goodbye\")\n" } ``` -This seems even more tempting as LLMs -get better tooling for reliably generating -valid json, or even enforcing that it meets a specific schema. -For example, OpenAI recently announced -[strict enforcement of json responses](). +This has become even more tempting as LLM providers +continue to improve their tooling for reliably generating +valid json. +For example, OpenAI recently announced the ability to +[strictly enforce that json responses will be syntactically correct +and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/). -But it's not sufficient to just produce -valid json, it also -has to contain quality code. -Unfortunately, +But producing valid (schema compliant) json is not sufficient for this use case. +The json also has to contain valid, high quality code. +And unfortunately, LLMs write worse code when they're asked to -emit it wrapped in json. +wrap it in json. In some sense this shouldn't be surprising. Just look at the very simple @@ -176,24 +179,26 @@ json example above, with the escaped quotes `\"` and newlines `\n` mixed into the code. -Coding is complicated enough without having to escape all the special characters too. +Imagine if the code itself contained json or other quoted strings, +with their +own escape sequences. If you tried to write a program, would you do a better job -typing it normally +typing it out normally or as a properly escaped json string? + ## Quantifying the benefits of plain text - -Previous [benchmark results](/2023/07/02/benchmarks.html) +Previous [aider benchmark results](/2023/07/02/benchmarks.html) showed the superiority of returning code -as plain text coding compared to json-wrapped function calls. -But those results were obtained +as plain text coding compared to json-wrapped function calls. +Those results were obtained over a year ago, against far less -capable models. +capable models. OpenAI's newly announced support for "strict" json seemed like a good reason to investigate whether the newest models are still handicapped by json-wrapping code. @@ -207,17 +212,18 @@ results from Each model was given one try to solve [133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). -This is the standard aider "code editing" benchmark, except restricted to a single attempt. +This is the standard aider "code editing" benchmark, but restricted to a single attempt +without a second try to "fix" any errors. -Each model was assessed by the benchmark with two +Each model was assessed by the benchmark using two different strategies for returning code: -- **Markdown** -- where the model simply returned the whole source code file in standard markdown triple-backtick fences. -- **Tool call** -- where the model is told to use a function to return the whole source code file. This requires the LLM to wrap the code in json. +- **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences. +- **Tool call** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in json. The markdown strategy is the same as -aider's "whole" edit format. -It asks the LLM to return a program like this: +aider's "whole" edit format, where the +LLM would return a source file like this: ```` Here is the program you asked for which prints "Hello": @@ -230,7 +236,9 @@ def greeting(): ```` The tool strategy requires the LLM to call the `write_file` function with -two parameters, like this: +two parameters, as shown below. +For maximum simplicity, the LLM didn't even have to specify the filename, +since the benchmark operates only on a single source file. ``` { @@ -242,13 +250,14 @@ two parameters, like this: Both of these formats avoid actually *editing* source files, to keep the task as simple as possible. -The LLM can emit the whole source file intact, +The LLM is able to emit the whole source file intact, which is much easier than correctly formulating instructions to edit portions of a file. -We are simply testing the effects of json-wrapping on the LLMs ability to write code to solve a task. +This experimental setup is designed to highlight +the effects of json-wrapping on the LLMs ability to write code to solve a task. ## Results From 341c08be3ecd58f1371b6e58eee8dbad57b57910 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 08:08:58 -0700 Subject: [PATCH 15/34] feat: average datapoints for each model/edit_format --- aider/website/_posts/2024-08-14-code-in-json.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 572aaf3be..bde150140 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -27,8 +27,10 @@ document.addEventListener('DOMContentLoaded', function () { var datasets = editFormats.map(format => ({ label: format, data: models.map(model => { - var item = yamlData.find(d => d.model === model && d.edit_format === format); - return item ? item.pass_rate_1 : null; + var items = yamlData.filter(d => d.model === model && d.edit_format === format); + if (items.length === 0) return null; + var average = items.reduce((sum, item) => sum + item.pass_rate_1, 0) / items.length; + return parseFloat(average.toFixed(1)); }), backgroundColor: function(context) { const format = context.dataset.label; From 9982cda5085dd450592486fd067943f3b984707a Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 08:11:54 -0700 Subject: [PATCH 16/34] 5 benchmark runs --- aider/website/_data/code-in-json.yml | 848 ++++++++++++++++++++++++--- 1 file changed, 767 insertions(+), 81 deletions(-) diff --git a/aider/website/_data/code-in-json.yml b/aider/website/_data/code-in-json.yml index 64c42a2d5..0f2bbcbed 100644 --- a/aider/website/_data/code-in-json.yml +++ b/aider/website/_data/code-in-json.yml @@ -1,9 +1,9 @@ -- dirname: 2024-08-14-18-26-18--json-gpt-4o-2024-08-06-whole +- dirname: 2024-08-15-13-17-11--json-no-lint-gpt-4o-2024-08-06-whole test_cases: 133 - model: gpt-4o-2024-08-06 - edit_format: Markdown - commit_hash: 94a2601-dirty - pass_rate_1: 62.4 + model: openai/gpt-4o-2024-08-06 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 percent_cases_well_formed: 100.0 error_outputs: 0 num_malformed_responses: 0 @@ -13,62 +13,395 @@ syntax_errors: 0 indentation_errors: 0 exhausted_context_windows: 0 - test_timeouts: 3 - command: aider --model gpt-4o-2024-08-06 - date: 2024-08-14 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 6.8 - total_cost: 1.2717 - -- dirname: 2024-08-14-18-38-25--json-gpt-4o-2024-08-06-non-strict-func + seconds_per_case: 4.3 + total_cost: 0.7965 +- dirname: 2024-08-15-13-18-36--json-no-lint-gpt-4o-2024-08-06-func test_cases: 133 - model: gpt-4o-2024-08-06 - edit_format: Tool call - commit_hash: 2eb1946-dirty - pass_rate_1: 54.1 + model: openai/gpt-4o-2024-08-06 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 57.9 percent_cases_well_formed: 100.0 - error_outputs: 7 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 5.7 + total_cost: 0.8417 +- dirname: 2024-08-15-13-20-11--json-no-lint-gpt-4o-2024-05-13-whole + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 56.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 8.0 + total_cost: 1.5034 +- dirname: 2024-08-15-13-21-55--json-no-lint-gpt-4o-2024-05-13-func + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 2 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.1 + total_cost: 1.2285 +- dirname: 2024-08-15-13-23-33--json-no-lint-claude-3.5-sonnet-whole + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 10.5 + total_cost: 1.6714 +- dirname: 2024-08-15-13-24-56--json-no-lint-claude-3.5-sonnet-func + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 53.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 9.7 + total_cost: 1.5980 +- dirname: 2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 2 num_malformed_responses: 0 num_with_malformed_responses: 0 user_asks: 2 lazy_comments: 0 - syntax_errors: 2 + syntax_errors: 0 indentation_errors: 0 exhausted_context_windows: 0 - test_timeouts: 4 - command: aider --model gpt-4o-2024-08-06 - date: 2024-08-14 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 11.5 - total_cost: 1.3819 - -- dirname: 2024-08-14-18-32-02--json-gpt-4o-2024-08-06-strict-func + seconds_per_case: 27.9 + total_cost: 0.0438 +- dirname: 2024-08-15-13-29-55--json-no-lint-deepseek-coder-func test_cases: 133 - model: gpt-4o-2024-08-06 - edit_format: Tool call (strict) - commit_hash: 2eb1946 + model: openrouter/deepseek/deepseek-coder + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 49.6 + percent_cases_well_formed: 100.0 + error_outputs: 3 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 4 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 20.5 + total_cost: 0.0329 +- dirname: 2024-08-15-13-50-03--json-no-lint-gpt-4o-2024-08-06-whole-2 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 61.7 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 4.2 + total_cost: 0.7946 +- dirname: 2024-08-15-13-51-36--json-no-lint-gpt-4o-2024-08-06-func-2 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: func + commit_hash: bac04a2 pass_rate_1: 56.4 percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.4 + total_cost: 0.8390 +- dirname: 2024-08-15-13-53-23--json-no-lint-gpt-4o-2024-05-13-whole-2 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.4 + total_cost: 1.4996 +- dirname: 2024-08-15-13-54-53--json-no-lint-gpt-4o-2024-05-13-func-2 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.7 + total_cost: 1.2210 +- dirname: 2024-08-15-13-56-21--json-no-lint-claude-3.5-sonnet-whole-2 + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 error_outputs: 1 num_malformed_responses: 0 num_with_malformed_responses: 0 user_asks: 0 lazy_comments: 0 - syntax_errors: 7 + syntax_errors: 0 indentation_errors: 0 exhausted_context_windows: 0 - test_timeouts: 4 - command: aider --model gpt-4o-2024-08-06 - date: 2024-08-14 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 12.7 - total_cost: 1.3652 - -- dirname: 2024-08-14-20-15-19--json-sonnet-whole + seconds_per_case: 16.5 + total_cost: 1.6556 +- dirname: 2024-08-15-14-02-15--json-no-lint-claude-3.5-sonnet-func-2 test_cases: 133 - model: claude-3.5-sonnet - edit_format: Markdown - commit_hash: e2f14a2 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 51.9 + percent_cases_well_formed: 100.0 + error_outputs: 1 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 14.3 + total_cost: 1.5835 +- dirname: 2024-08-15-14-06-12--json-no-lint-deepseek-coder-whole-2 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 2 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 1 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 25.8 + total_cost: 0.0439 +- dirname: 2024-08-15-14-09-22--json-no-lint-deepseek-coder-func-2 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 53.4 + percent_cases_well_formed: 100.0 + error_outputs: 5 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 6 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 18.8 + total_cost: 0.0333 +- dirname: 2024-08-15-14-11-45--json-no-lint-gpt-4o-2024-08-06-whole-3 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 4.3 + total_cost: 0.7945 +- dirname: 2024-08-15-14-13-11--json-no-lint-gpt-4o-2024-08-06-func-3 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 56.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 5.6 + total_cost: 0.8220 +- dirname: 2024-08-15-14-14-40--json-no-lint-gpt-4o-2024-05-13-whole-3 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 61.7 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 6 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 8.8 + total_cost: 1.4993 +- dirname: 2024-08-15-14-16-34--json-no-lint-gpt-4o-2024-05-13-func-3 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: func + commit_hash: bac04a2 pass_rate_1: 58.6 percent_cases_well_formed: 100.0 error_outputs: 0 @@ -80,75 +413,428 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model claude-3.5-sonnet - date: 2024-08-14 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 19.7 - total_cost: 2.5335 - -- dirname: 2024-08-14-20-19-23--json-sonnet-non-strict-func + seconds_per_case: 8.7 + total_cost: 1.2064 +- dirname: 2024-08-15-14-17-51--json-no-lint-claude-3.5-sonnet-whole-3 test_cases: 133 - model: claude-3.5-sonnet - edit_format: Tool call - commit_hash: e2f14a2 - pass_rate_1: 52.6 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 percent_cases_well_formed: 100.0 - error_outputs: 1 + error_outputs: 0 num_malformed_responses: 0 num_with_malformed_responses: 0 - user_asks: 1 + user_asks: 0 lazy_comments: 0 - syntax_errors: 1 + syntax_errors: 0 indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model claude-3.5-sonnet - date: 2024-08-14 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 18.9 - total_cost: 2.6341 - -- dirname: 2024-08-14-21-23-27--json-deepseek-whole + seconds_per_case: 11.0 + total_cost: 1.6555 +- dirname: 2024-08-15-14-19-19--json-no-lint-claude-3.5-sonnet-func-3 test_cases: 133 - model: deepseek-coder - edit_format: Markdown - commit_hash: e2f14a2 - pass_rate_1: 61.7 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 51.1 percent_cases_well_formed: 100.0 - error_outputs: 1 + error_outputs: 3 num_malformed_responses: 0 num_with_malformed_responses: 0 - user_asks: 1 + user_asks: 0 lazy_comments: 0 syntax_errors: 0 indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model deepseek-coder - date: 2024-08-14 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 23.0 - total_cost: 0.0439 - -- dirname: 2024-08-14-21-20-46--json-deepseek-non-strict-func + seconds_per_case: 10.3 + total_cost: 1.5614 +- dirname: 2024-08-15-14-21-06--json-no-lint-deepseek-coder-whole-3 test_cases: 133 - model: deepseek-coder - edit_format: Tool call - commit_hash: e2f14a2 - pass_rate_1: 54.1 + model: openrouter/deepseek/deepseek-coder + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 61.7 percent_cases_well_formed: 100.0 - error_outputs: 9 + error_outputs: 3 num_malformed_responses: 0 num_with_malformed_responses: 0 - user_asks: 5 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 24.4 + total_cost: 0.0439 +- dirname: 2024-08-15-14-24-46--json-no-lint-deepseek-coder-func-3 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 52.6 + percent_cases_well_formed: 100.0 + error_outputs: 3 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 12 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 19.0 + total_cost: 0.0334 +- dirname: 2024-08-15-14-27-17--json-no-lint-gpt-4o-2024-08-06-whole-4 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 4.3 + total_cost: 0.8015 +- dirname: 2024-08-15-14-28-58--json-no-lint-gpt-4o-2024-08-06-func-4 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.0 + total_cost: 0.8394 +- dirname: 2024-08-15-14-30-48--json-no-lint-gpt-4o-2024-05-13-whole-4 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 61.7 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 6 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 12.3 + total_cost: 1.4919 +- dirname: 2024-08-15-14-32-58--json-no-lint-gpt-4o-2024-05-13-func-4 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 lazy_comments: 0 syntax_errors: 2 indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model deepseek-coder - date: 2024-08-14 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 versions: 0.50.2-dev - seconds_per_case: 17.4 - total_cost: 0.0332 - + seconds_per_case: 11.1 + total_cost: 1.2120 +- dirname: 2024-08-15-14-34-39--json-no-lint-claude-3.5-sonnet-whole-4 + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 11.3 + total_cost: 1.6635 +- dirname: 2024-08-15-14-36-18--json-no-lint-claude-3.5-sonnet-func-4 + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 55.6 + percent_cases_well_formed: 100.0 + error_outputs: 1 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 10.5 + total_cost: 1.5768 +- dirname: 2024-08-15-14-38-35--json-no-lint-deepseek-coder-whole-4 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 2 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 24.5 + total_cost: 0.0438 +- dirname: 2024-08-15-14-41-36--json-no-lint-deepseek-coder-func-4 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 49.6 + percent_cases_well_formed: 100.0 + error_outputs: 7 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 2 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 18.7 + total_cost: 0.0333 +- dirname: 2024-08-15-14-44-11--json-no-lint-gpt-4o-2024-08-06-whole-5 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 4.6 + total_cost: 0.8023 +- dirname: 2024-08-15-14-45-40--json-no-lint-gpt-4o-2024-08-06-func-5 + test_cases: 133 + model: openai/gpt-4o-2024-08-06 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 57.1 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 3 + command: aider --model openai/gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.3 + total_cost: 0.8354 +- dirname: 2024-08-15-14-47-39--json-no-lint-gpt-4o-2024-05-13-whole-5 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 9 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 10.7 + total_cost: 1.4982 +- dirname: 2024-08-15-14-49-44--json-no-lint-gpt-4o-2024-05-13-func-5 + test_cases: 133 + model: openai/gpt-4o-2024-05-13 + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 4 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openai/gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 10.5 + total_cost: 1.2099 +- dirname: 2024-08-15-14-51-18--json-no-lint-claude-3.5-sonnet-whole-5 + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 11.4 + total_cost: 1.6685 +- dirname: 2024-08-15-14-52-48--json-no-lint-claude-3.5-sonnet-func-5 + test_cases: 133 + model: openrouter/anthropic/claude-3.5-sonnet + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 53.4 + percent_cases_well_formed: 100.0 + error_outputs: 2 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model openrouter/anthropic/claude-3.5-sonnet + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 10.8 + total_cost: 1.5786 +- dirname: 2024-08-15-14-54-41--json-no-lint-deepseek-coder-whole-5 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: whole + commit_hash: bac04a2 + pass_rate_1: 61.7 + percent_cases_well_formed: 100.0 + error_outputs: 2 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 2 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 24.5 + total_cost: 0.0439 +- dirname: 2024-08-15-14-57-51--json-no-lint-deepseek-coder-func-5 + test_cases: 133 + model: openrouter/deepseek/deepseek-coder + edit_format: func + commit_hash: bac04a2 + pass_rate_1: 53.4 + percent_cases_well_formed: 100.0 + error_outputs: 5 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 4 + indentation_errors: 1 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model openrouter/deepseek/deepseek-coder + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 18.5 + total_cost: 0.0330 From 957374a6114305d94b5956b26437710718e59ef5 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 08:29:43 -0700 Subject: [PATCH 17/34] fix: Update code-in-json post with improved formatting and performance details --- aider/website/_posts/2024-08-14-code-in-json.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index bde150140..22f2dc852 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -36,9 +36,9 @@ document.addEventListener('DOMContentLoaded', function () { const format = context.dataset.label; if (format === 'Markdown') { return 'rgba(54, 162, 235, 0.8)'; - } else if (format.startsWith('Tool call')) { + } else if (format.startsWith('JSON')) { const ctx = context.chart.ctx; - const gradient = ctx.createPattern(createStripedCanvas(format === 'Tool call (strict)'), 'repeat'); + const gradient = ctx.createPattern(createStripedCanvas(format === 'JSON (strict)'), 'repeat'); return gradient; } else { return 'rgba(75, 192, 192, 0.8)'; @@ -124,8 +124,9 @@ receive code from LLMs. Unfortunately, LLMs write worse code when asked to wrap it in json, harming their ability to correctly solve coding tasks. -Returning code as plain (markdown) text results in an average of 6% higher scores -on a variant of the aider code editing benchmark. +Returning code as plain (markdown) text results in lower scores +on a variant of the aider code editing benchmark, often significantly harming coding +performance. This holds true across many top coding LLMs, and even OpenAI's newest gpt-4o-2024-08-06 with "strict" json support suffers from this code-in-json handicap. From ea38f91c702d5eba27af87b7aedb4d1a29204f65 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 08:29:44 -0700 Subject: [PATCH 18/34] feat: Sort x-axis by model name --- aider/website/_posts/2024-08-14-code-in-json.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 22f2dc852..23b58aa33 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -21,7 +21,7 @@ document.addEventListener('DOMContentLoaded', function () { var yamlData = {{ site.data.code-in-json | jsonify }}; - var models = [...new Set(yamlData.map(item => item.model))]; + var models = [...new Set(yamlData.map(item => item.model))].sort(); var editFormats = [...new Set(yamlData.map(item => item.edit_format))]; var datasets = editFormats.map(format => ({ From 04e816ff2e2de14359e69a7e903357e8697523e8 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 09:49:51 -0700 Subject: [PATCH 19/34] copy --- aider/website/_data/code-in-json.yml | 324 +++++++++++------- .../website/_posts/2024-08-14-code-in-json.md | 118 ++++--- 2 files changed, 282 insertions(+), 160 deletions(-) diff --git a/aider/website/_data/code-in-json.yml b/aider/website/_data/code-in-json.yml index 0f2bbcbed..78efd129f 100644 --- a/aider/website/_data/code-in-json.yml +++ b/aider/website/_data/code-in-json.yml @@ -1,7 +1,7 @@ - dirname: 2024-08-15-13-17-11--json-no-lint-gpt-4o-2024-08-06-whole test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: whole + model: gpt-4o-2024-08-06 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -14,15 +14,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 4.3 total_cost: 0.7965 - dirname: 2024-08-15-13-18-36--json-no-lint-gpt-4o-2024-08-06-func test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: func + model: gpt-4o-2024-08-06 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 57.9 percent_cases_well_formed: 100.0 @@ -35,15 +35,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 5.7 total_cost: 0.8417 - dirname: 2024-08-15-13-20-11--json-no-lint-gpt-4o-2024-05-13-whole test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: whole + model: gpt-4o-2024-05-13 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 56.4 percent_cases_well_formed: 100.0 @@ -56,15 +56,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 8.0 total_cost: 1.5034 - dirname: 2024-08-15-13-21-55--json-no-lint-gpt-4o-2024-05-13-func test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: func + model: gpt-4o-2024-05-13 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -77,15 +77,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 7.1 total_cost: 1.2285 - dirname: 2024-08-15-13-23-33--json-no-lint-claude-3.5-sonnet-whole test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: whole + model: claude-3.5-sonnet + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -98,15 +98,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.5 total_cost: 1.6714 - dirname: 2024-08-15-13-24-56--json-no-lint-claude-3.5-sonnet-func test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: func + model: claude-3.5-sonnet + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 53.4 percent_cases_well_formed: 100.0 @@ -119,15 +119,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 9.7 total_cost: 1.5980 - dirname: 2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: whole + model: deepseek-coder V2 0724 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 59.4 percent_cases_well_formed: 100.0 @@ -140,15 +140,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 27.9 total_cost: 0.0438 - dirname: 2024-08-15-13-29-55--json-no-lint-deepseek-coder-func test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: func + model: deepseek-coder V2 0724 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 49.6 percent_cases_well_formed: 100.0 @@ -161,15 +161,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 20.5 total_cost: 0.0329 - dirname: 2024-08-15-13-50-03--json-no-lint-gpt-4o-2024-08-06-whole-2 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: whole + model: gpt-4o-2024-08-06 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 61.7 percent_cases_well_formed: 100.0 @@ -182,15 +182,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 4.2 total_cost: 0.7946 - dirname: 2024-08-15-13-51-36--json-no-lint-gpt-4o-2024-08-06-func-2 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: func + model: gpt-4o-2024-08-06 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 56.4 percent_cases_well_formed: 100.0 @@ -203,15 +203,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 6.4 total_cost: 0.8390 - dirname: 2024-08-15-13-53-23--json-no-lint-gpt-4o-2024-05-13-whole-2 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: whole + model: gpt-4o-2024-05-13 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 59.4 percent_cases_well_formed: 100.0 @@ -224,15 +224,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 7.4 total_cost: 1.4996 - dirname: 2024-08-15-13-54-53--json-no-lint-gpt-4o-2024-05-13-func-2 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: func + model: gpt-4o-2024-05-13 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -245,15 +245,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 7.7 total_cost: 1.2210 - dirname: 2024-08-15-13-56-21--json-no-lint-claude-3.5-sonnet-whole-2 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: whole + model: claude-3.5-sonnet + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.9 percent_cases_well_formed: 100.0 @@ -266,15 +266,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 16.5 total_cost: 1.6556 - dirname: 2024-08-15-14-02-15--json-no-lint-claude-3.5-sonnet-func-2 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: func + model: claude-3.5-sonnet + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 51.9 percent_cases_well_formed: 100.0 @@ -287,15 +287,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 14.3 total_cost: 1.5835 - dirname: 2024-08-15-14-06-12--json-no-lint-deepseek-coder-whole-2 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: whole + model: deepseek-coder V2 0724 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.9 percent_cases_well_formed: 100.0 @@ -308,15 +308,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 25.8 total_cost: 0.0439 - dirname: 2024-08-15-14-09-22--json-no-lint-deepseek-coder-func-2 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: func + model: deepseek-coder V2 0724 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 53.4 percent_cases_well_formed: 100.0 @@ -329,15 +329,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 18.8 total_cost: 0.0333 - dirname: 2024-08-15-14-11-45--json-no-lint-gpt-4o-2024-08-06-whole-3 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: whole + model: gpt-4o-2024-08-06 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.9 percent_cases_well_formed: 100.0 @@ -350,15 +350,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 4.3 total_cost: 0.7945 - dirname: 2024-08-15-14-13-11--json-no-lint-gpt-4o-2024-08-06-func-3 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: func + model: gpt-4o-2024-08-06 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 56.4 percent_cases_well_formed: 100.0 @@ -371,15 +371,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 5.6 total_cost: 0.8220 - dirname: 2024-08-15-14-14-40--json-no-lint-gpt-4o-2024-05-13-whole-3 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: whole + model: gpt-4o-2024-05-13 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 61.7 percent_cases_well_formed: 100.0 @@ -392,15 +392,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 8.8 total_cost: 1.4993 - dirname: 2024-08-15-14-16-34--json-no-lint-gpt-4o-2024-05-13-func-3 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: func + model: gpt-4o-2024-05-13 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 58.6 percent_cases_well_formed: 100.0 @@ -413,15 +413,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 8.7 total_cost: 1.2064 - dirname: 2024-08-15-14-17-51--json-no-lint-claude-3.5-sonnet-whole-3 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: whole + model: claude-3.5-sonnet + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -434,15 +434,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 11.0 total_cost: 1.6555 - dirname: 2024-08-15-14-19-19--json-no-lint-claude-3.5-sonnet-func-3 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: func + model: claude-3.5-sonnet + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 51.1 percent_cases_well_formed: 100.0 @@ -455,15 +455,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.3 total_cost: 1.5614 - dirname: 2024-08-15-14-21-06--json-no-lint-deepseek-coder-whole-3 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: whole + model: deepseek-coder V2 0724 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 61.7 percent_cases_well_formed: 100.0 @@ -476,15 +476,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 3 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 24.4 total_cost: 0.0439 - dirname: 2024-08-15-14-24-46--json-no-lint-deepseek-coder-func-3 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: func + model: deepseek-coder V2 0724 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 52.6 percent_cases_well_formed: 100.0 @@ -497,15 +497,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 19.0 total_cost: 0.0334 - dirname: 2024-08-15-14-27-17--json-no-lint-gpt-4o-2024-08-06-whole-4 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: whole + model: gpt-4o-2024-08-06 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -518,15 +518,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 4.3 total_cost: 0.8015 - dirname: 2024-08-15-14-28-58--json-no-lint-gpt-4o-2024-08-06-func-4 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: func + model: gpt-4o-2024-08-06 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -539,15 +539,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 6.0 total_cost: 0.8394 - dirname: 2024-08-15-14-30-48--json-no-lint-gpt-4o-2024-05-13-whole-4 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: whole + model: gpt-4o-2024-05-13 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 61.7 percent_cases_well_formed: 100.0 @@ -560,15 +560,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 12.3 total_cost: 1.4919 - dirname: 2024-08-15-14-32-58--json-no-lint-gpt-4o-2024-05-13-func-4 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: func + model: gpt-4o-2024-05-13 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 59.4 percent_cases_well_formed: 100.0 @@ -581,15 +581,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 11.1 total_cost: 1.2120 - dirname: 2024-08-15-14-34-39--json-no-lint-claude-3.5-sonnet-whole-4 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: whole + model: claude-3.5-sonnet + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.9 percent_cases_well_formed: 100.0 @@ -602,15 +602,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 11.3 total_cost: 1.6635 - dirname: 2024-08-15-14-36-18--json-no-lint-claude-3.5-sonnet-func-4 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: func + model: claude-3.5-sonnet + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 55.6 percent_cases_well_formed: 100.0 @@ -623,15 +623,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.5 total_cost: 1.5768 - dirname: 2024-08-15-14-38-35--json-no-lint-deepseek-coder-whole-4 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: whole + model: deepseek-coder V2 0724 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 59.4 percent_cases_well_formed: 100.0 @@ -644,15 +644,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 24.5 total_cost: 0.0438 - dirname: 2024-08-15-14-41-36--json-no-lint-deepseek-coder-func-4 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: func + model: deepseek-coder V2 0724 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 49.6 percent_cases_well_formed: 100.0 @@ -665,15 +665,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 18.7 total_cost: 0.0333 - dirname: 2024-08-15-14-44-11--json-no-lint-gpt-4o-2024-08-06-whole-5 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: whole + model: gpt-4o-2024-08-06 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.9 percent_cases_well_formed: 100.0 @@ -686,15 +686,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 4.6 total_cost: 0.8023 - dirname: 2024-08-15-14-45-40--json-no-lint-gpt-4o-2024-08-06-func-5 test_cases: 133 - model: openai/gpt-4o-2024-08-06 - edit_format: func + model: gpt-4o-2024-08-06 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 57.1 percent_cases_well_formed: 100.0 @@ -707,15 +707,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 3 - command: aider --model openai/gpt-4o-2024-08-06 + command: aider --model gpt-4o-2024-08-06 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 6.3 total_cost: 0.8354 - dirname: 2024-08-15-14-47-39--json-no-lint-gpt-4o-2024-05-13-whole-5 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: whole + model: gpt-4o-2024-05-13 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -728,15 +728,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.7 total_cost: 1.4982 - dirname: 2024-08-15-14-49-44--json-no-lint-gpt-4o-2024-05-13-func-5 test_cases: 133 - model: openai/gpt-4o-2024-05-13 - edit_format: func + model: gpt-4o-2024-05-13 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 59.4 percent_cases_well_formed: 100.0 @@ -749,15 +749,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openai/gpt-4o-2024-05-13 + command: aider --model gpt-4o-2024-05-13 date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.5 total_cost: 1.2099 - dirname: 2024-08-15-14-51-18--json-no-lint-claude-3.5-sonnet-whole-5 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: whole + model: claude-3.5-sonnet + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 60.2 percent_cases_well_formed: 100.0 @@ -770,15 +770,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 11.4 total_cost: 1.6685 - dirname: 2024-08-15-14-52-48--json-no-lint-claude-3.5-sonnet-func-5 test_cases: 133 - model: openrouter/anthropic/claude-3.5-sonnet - edit_format: func + model: claude-3.5-sonnet + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 53.4 percent_cases_well_formed: 100.0 @@ -791,15 +791,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 1 - command: aider --model openrouter/anthropic/claude-3.5-sonnet + command: aider --model claude-3.5-sonnet date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 10.8 total_cost: 1.5786 - dirname: 2024-08-15-14-54-41--json-no-lint-deepseek-coder-whole-5 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: whole + model: deepseek-coder V2 0724 + edit_format: Markdown commit_hash: bac04a2 pass_rate_1: 61.7 percent_cases_well_formed: 100.0 @@ -812,15 +812,15 @@ indentation_errors: 0 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 24.5 total_cost: 0.0439 - dirname: 2024-08-15-14-57-51--json-no-lint-deepseek-coder-func-5 test_cases: 133 - model: openrouter/deepseek/deepseek-coder - edit_format: func + model: deepseek-coder V2 0724 + edit_format: JSON commit_hash: bac04a2 pass_rate_1: 53.4 percent_cases_well_formed: 100.0 @@ -833,8 +833,92 @@ indentation_errors: 1 exhausted_context_windows: 0 test_timeouts: 0 - command: aider --model openrouter/deepseek/deepseek-coder + command: aider --model deepseek-coder date: 2024-08-15 versions: 0.50.2-dev seconds_per_case: 18.5 total_cost: 0.0330 +- dirname: 2024-08-15-15-12-55--json-no-lint-strict-gpt-4o-2024-08-06-func-2 + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: JSON (strict) + commit_hash: bf2d5fe + pass_rate_1: 57.1 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 5.9 + total_cost: 0.8216 +- dirname: 2024-08-15-15-14-31--json-no-lint-strict-gpt-4o-2024-08-06-func-3 + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: JSON (strict) + commit_hash: bf2d5fe + pass_rate_1: 54.1 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 2 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.3 + total_cost: 0.8410 +- dirname: 2024-08-15-15-16-14--json-no-lint-strict-gpt-4o-2024-08-06-func-4 + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: JSON (strict) + commit_hash: bf2d5fe + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 5.9 + total_cost: 0.8203 +- dirname: 2024-08-15-15-17-50--json-no-lint-strict-gpt-4o-2024-08-06-func-5 + test_cases: 133 + model: gpt-4o-2024-08-06 + edit_format: JSON (strict) + commit_hash: bf2d5fe + pass_rate_1: 57.1 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 1 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-08-06 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.1 + total_cost: 0.8415 diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 23b58aa33..9f3345971 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -1,6 +1,6 @@ --- -title: LLMs are bad at returning code in json -excerpt: LLMs write worse code if you ask them to return the code wrapped in json (via a tool or function call). +title: LLMs are bad at returning code in JSON +excerpt: LLMs write worse code if you ask them to return the code wrapped in JSON (via a tool or function call). highlight_image: /assets/code-in-json.jpg draft: true nav_exclude: true @@ -9,7 +9,7 @@ nav_exclude: true {% endif %} -# LLMs are bad at returning code in json +# LLMs are bad at returning code in JSON @@ -67,7 +67,7 @@ document.addEventListener('DOMContentLoaded', function () { beginAtZero: true, title: { display: true, - text: 'Pass Rate (%)' + text: 'Pass Rate (%, average of 5 runs)' }, max: 70 } @@ -75,7 +75,7 @@ document.addEventListener('DOMContentLoaded', function () { plugins: { title: { display: true, - text: 'Pass rate by model and code return strategy', + text: 'Pass rate by model and code wrapping strategy', font: { size: 16 } @@ -116,20 +116,22 @@ document.addEventListener('DOMContentLoaded', function () { ## Abstract -The newest LLMs have support for returning properly formatted json responses, +The newest LLMs have support for returning properly formatted JSON responses, making it easy for client applications to parse complex responses. This makes it tempting for AI coding applications to use tool function calls or other structured reply formats to receive code from LLMs. Unfortunately, -LLMs write worse code when asked to wrap it in json, harming their ability +LLMs write worse code when asked to wrap it in JSON, harming their ability to correctly solve coding tasks. -Returning code as plain (markdown) text results in lower scores -on a variant of the aider code editing benchmark, often significantly harming coding -performance. +On a variant of the aider code editing benchmark, +JSON-wrapping code +often significantly harms coding +performance +compared to returning code as plain (markdown) text. This holds true across many top coding LLMs, -and even OpenAI's newest gpt-4o-2024-08-06 with "strict" json support -suffers from this code-in-json handicap. +and even OpenAI's newest gpt-4o-2024-08-06 with "strict" JSON support +suffers from this code-in-JSON handicap. ## Introduction @@ -152,8 +154,7 @@ def greeting(): ```` People expect that it would be easier and more reliable to use tool calls, -and parse a nicely formatted json -response: +which would return a structured JSON response: ``` { @@ -165,32 +166,33 @@ response: This has become even more tempting as LLM providers continue to improve their tooling for reliably generating -valid json. +valid JSON. For example, OpenAI recently announced the ability to -[strictly enforce that json responses will be syntactically correct +[strictly enforce that JSON responses will be syntactically correct and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/). -But producing valid (schema compliant) json is not sufficient for this use case. -The json also has to contain valid, high quality code. -And unfortunately, +But producing valid (schema compliant) JSON is not sufficient for this use case. +The JSON also has to contain valid, high quality code. +Unfortunately, LLMs write worse code when they're asked to -wrap it in json. +wrap it in JSON. In some sense this shouldn't be surprising. Just look at the very simple -json example above, with the escaped +JSON example above, with the escaped quotes `\"` and newlines `\n` mixed into the code. -Imagine if the code itself contained json or other quoted strings, +Imagine the additional +complexity +if the code itself contained JSON or other quoted strings, with their own escape sequences. -If you tried to write a program, -would you do a better job +Would *you* write better code by typing it out normally or as a properly escaped -json string? +JSON string? ## Quantifying the benefits of plain text @@ -198,31 +200,33 @@ json string? Previous [aider benchmark results](/2023/07/02/benchmarks.html) showed the superiority of returning code -as plain text coding compared to json-wrapped function calls. +as plain text compared to JSON-wrapped function calls. Those results were obtained over a year ago, against far less capable models. -OpenAI's newly announced support for "strict" json seemed like a good reason to -investigate whether the newest models are still handicapped by json-wrapping code. +OpenAI's newly announced support for "strict" JSON seemed like a good reason to +investigate whether the newest models are still handicapped by JSON-wrapping code. The graph above shows benchmark results from -3 of the strongest code editing models: +4 of the strongest code editing models: -- gpt-4o-2024-08-06 - claude-3-5-sonnet-20240620 - deepseek-coder (V2 0724) +- gpt-4o-2024-05-13 +- gpt-4o-2024-08-06 Each model was given one try to solve [133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). This is the standard aider "code editing" benchmark, but restricted to a single attempt without a second try to "fix" any errors. -Each model was assessed by the benchmark using two -different strategies for returning code: +The benchmark assessed the models coding ability +using different strategies for returning code: - **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences. -- **Tool call** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in json. +- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON. +- **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting. The markdown strategy is the same as aider's "whole" edit format, where the @@ -238,10 +242,10 @@ def greeting(): ``` ```` -The tool strategy requires the LLM to call the `write_file` function with +The JSON and JSON (strict) strategies required the LLM to call the `write_file` function with two parameters, as shown below. -For maximum simplicity, the LLM didn't even have to specify the filename, -since the benchmark operates only on a single source file. +For maximum simplicity, the LLM didn't have to specify the filename, +since the benchmark operates on one source file at a time. ``` { @@ -250,7 +254,7 @@ since the benchmark operates only on a single source file. } ``` -Both of these formats avoid actually *editing* source files, to keep +These strategies avoid actually *editing* source files, to keep the task as simple as possible. The LLM is able to emit the whole source file intact, @@ -260,9 +264,43 @@ instructions to edit portions of a file. This experimental setup is designed to highlight -the effects of json-wrapping on the LLMs ability to write code to solve a task. +the effects of JSON-wrapping on the LLMs ability to write code to solve a task. +The results in the graph are the average of 5 runs for each +model & strategy combination. ## Results -All 3 models did significantly worse on the benchmark when asked to -return json-wrapped code in a tool function call. +All of the models did worse on the benchmark when asked to +return JSON-wrapped code in a tool function call. +Most did significantly worse, performing far below +the result obtained with the markdown strategy. + +Some noteworthy observations: + +- OpenAI's gpt-4o-2024-05-13 was the only model where the markdown and JSON results were +close. Using JSON only dropped the score by 0.3 percent, a difference which is +probably within the margin of error for 5 trials. +- The use of OpenAI's new strict mode seemed to harm the results for gpt-4o-2024-08-06 +as compared to non-strict JSON. +Of course, both JSON results were well below the markdown result. +- The results from Sonnet and DeepSeek Coder suffered the worst harm from JSON wrapping. + +## Conclusions + +While the quantitative results differ from the similar +[July 2023 experiments](/2023/07/02/benchmarks.html), +the conclusion seems unchanged: LLMs are bad at returning code in JSON. + +OpenAI appears to be making progress in allowing LLMs to return code in +structured JSON responses without harming the code quality. +But it seems premature to consider switching from plain text +to JSON-wrapped code. + + +## Notes on the aider leaderboard + +The results presented here are not directly comparable to results +from the main +[aider LLM leaderboard](https://aider.chat/docs/leaderboards/). +A number of settings were changed to simplify the benchmark +in order to focus on comparing plain text and JSON wrapped code. From 5ccdebf2c0a5e094949f0ca1da4be07ae006c6ff Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 09:50:50 -0700 Subject: [PATCH 20/34] refactor: Extract color assignment logic into a separate function --- benchmark/over_time.py | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/benchmark/over_time.py b/benchmark/over_time.py index 565038a8e..f72bac31e 100644 --- a/benchmark/over_time.py +++ b/benchmark/over_time.py @@ -6,6 +6,17 @@ from matplotlib import rc from aider.dump import dump # noqa: 401 +def get_model_color(model): + if "-4o" in model and "gpt-4o-mini" not in model: + return "purple" + elif "gpt-4" in model: + return "red" + elif "gpt-3.5" in model: + return "green" + else: + return "lightblue" + + def plot_over_time(yaml_file): with open(yaml_file, "r") as file: data = yaml.safe_load(file) @@ -49,14 +60,7 @@ def plot_over_time(yaml_file): spine.set_edgecolor("#DDDDDD") spine.set_linewidth(0.5) - colors = [ - ( - "purple" - if "-4o" in model and "gpt-4o-mini" not in model - else "red" if "gpt-4" in model else "green" if "gpt-3.5" in model else "lightblue" - ) - for model in models - ] + colors = [get_model_color(model) for model in models] # Separate data points by color purple_points = [(d, r) for d, r, c in zip(dates, pass_rates, colors) if c == "purple"] From 822a8ab671f49a25a25259802b178bc02534d4a7 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 09:52:21 -0700 Subject: [PATCH 21/34] remove gpt-4o-mini from the gpt-4 trendline --- aider/website/assets/models-over-time.svg | 169 +++++++++++----------- benchmark/over_time.py | 17 ++- 2 files changed, 96 insertions(+), 90 deletions(-) diff --git a/aider/website/assets/models-over-time.svg b/aider/website/assets/models-over-time.svg index a4fe87061..8fd066630 100644 --- a/aider/website/assets/models-over-time.svg +++ b/aider/website/assets/models-over-time.svg @@ -6,7 +6,7 @@ - 2024-08-14T06:29:51.758534 + 2024-08-15T09:51:56.911643 image/svg+xml @@ -39,7 +39,7 @@ z - - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + @@ -250,7 +250,7 @@ z - + @@ -297,7 +297,7 @@ z - + @@ -332,7 +332,7 @@ z - + @@ -383,7 +383,7 @@ z - + @@ -415,7 +415,7 @@ z - + @@ -455,7 +455,7 @@ z - + @@ -474,7 +474,7 @@ z - + @@ -493,7 +493,7 @@ z - + @@ -512,7 +512,7 @@ z - + @@ -789,16 +789,16 @@ z +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - - + @@ -812,11 +812,11 @@ L -3.5 0 +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -831,11 +831,11 @@ L 783.420506 273.255599 +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -850,11 +850,11 @@ L 783.420506 217.541699 +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -901,11 +901,11 @@ z +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -961,11 +961,11 @@ z +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1317,7 +1317,7 @@ z +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #800080; stroke-opacity: 0.5; stroke-linecap: square"/> +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #ff0000; stroke-opacity: 0.5; stroke-linecap: square"/> +" clip-path="url(#p463a66dc35)" style="fill: none; stroke: #008000; stroke-opacity: 0.5; stroke-linecap: square"/> + diff --git a/benchmark/over_time.py b/benchmark/over_time.py index f72bac31e..931b7e1d2 100644 --- a/benchmark/over_time.py +++ b/benchmark/over_time.py @@ -7,14 +7,21 @@ from aider.dump import dump # noqa: 401 def get_model_color(model): - if "-4o" in model and "gpt-4o-mini" not in model: + default = "lightblue" + + if model == "gpt-4o-mini": + return default + + if "-4o" in model: return "purple" - elif "gpt-4" in model: + + if "gpt-4" in model: return "red" - elif "gpt-3.5" in model: + + if "gpt-3.5" in model: return "green" - else: - return "lightblue" + + return default def plot_over_time(yaml_file): From 19073dd93904ee51b4a10d124d53b2a4432ddcc4 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 10:08:13 -0700 Subject: [PATCH 22/34] feat: Add section on overall coding skill and syntax errors to blog post on code in JSON --- aider/website/_posts/2024-08-14-code-in-json.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 9f3345971..601fab485 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -270,6 +270,9 @@ model & strategy combination. ## Results + +## Overall coding skill + All of the models did worse on the benchmark when asked to return JSON-wrapped code in a tool function call. Most did significantly worse, performing far below @@ -285,6 +288,9 @@ as compared to non-strict JSON. Of course, both JSON results were well below the markdown result. - The results from Sonnet and DeepSeek Coder suffered the worst harm from JSON wrapping. +## Syntax errors + + ## Conclusions While the quantitative results differ from the similar From 8f0cc731fdb5daddb5eb0498f9059ad92d9a2984 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 10:10:01 -0700 Subject: [PATCH 23/34] feat: Increase chart height on small screens --- .../website/_posts/2024-08-14-code-in-json.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 601fab485..b4db42cda 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -12,7 +12,9 @@ nav_exclude: true # LLMs are bad at returning code in JSON - +
+ +
+ ## Conclusions From a2882f4104711de5bc383e7d5fc5c79365572ed2 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 10:23:49 -0700 Subject: [PATCH 26/34] feat: Add createStripedCanvas function to second chart's script --- .../website/_posts/2024-08-14-code-in-json.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 5db5b05ea..f0e9a2243 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -128,6 +128,28 @@ document.addEventListener('DOMContentLoaded', function () { new Chart(ctx, config); }); + +function createStripedCanvas(isStrict) { + const patternCanvas = document.createElement('canvas'); + const patternContext = patternCanvas.getContext('2d'); + const size = 10; + patternCanvas.width = size; + patternCanvas.height = size; + + patternContext.fillStyle = 'rgba(255, 99, 132, 0.8)'; + patternContext.fillRect(0, 0, size, size); + + if (isStrict) { + patternContext.strokeStyle = 'rgba(255, 255, 255, 0.8)'; + patternContext.lineWidth = 0.75; + patternContext.beginPath(); + patternContext.moveTo(0, 0); + patternContext.lineTo(size, size); + patternContext.stroke(); + } + + return patternCanvas; +} From 2bb75dc11ffada9b7e0f468132e8c6f2051346b0 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 10:33:22 -0700 Subject: [PATCH 27/34] feat: Add figures and captions to blog post on code in JSON --- aider/website/_posts/2024-08-14-code-in-json.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index f0e9a2243..69553ad52 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -152,6 +152,8 @@ function createStripedCanvas(isStrict) { } +> Figure 1: Benchmark scores of models using either plain markdown text or JSON to return code. +> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. ## Abstract @@ -421,6 +423,9 @@ document.addEventListener('DOMContentLoaded', function () { }); +> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code. +> Models tend to make more syntactic errors when asked to wrap code in JSON. + ## Conclusions From 8d4d549a9834bdc6833b49a76140edef7d71a75d Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 10:34:49 -0700 Subject: [PATCH 28/34] catch litellm bug for image size --- aider/models.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/aider/models.py b/aider/models.py index cad99d1df..4e5bf74ca 100644 --- a/aider/models.py +++ b/aider/models.py @@ -516,7 +516,11 @@ class Model: def token_count(self, messages): if type(messages) is list: - return litellm.token_counter(model=self.name, messages=messages) + try: + return litellm.token_counter(model=self.name, messages=messages) + except Exception as err: + print(f"Unable to count tokens: {err}") + return 0 if not self.tokenizer: return From e90642295dc5f7d52043c1ffdc8bb54fb504c3c2 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 10:50:51 -0700 Subject: [PATCH 29/34] feat: Add code-in-json-benchmark.js file and update code-in-json.md post --- .../website/_includes/code-in-json-benchmark.js | 0 aider/website/_posts/2024-08-14-code-in-json.md | 16 +++++++--------- 2 files changed, 7 insertions(+), 9 deletions(-) create mode 100644 aider/website/_includes/code-in-json-benchmark.js diff --git a/aider/website/_includes/code-in-json-benchmark.js b/aider/website/_includes/code-in-json-benchmark.js new file mode 100644 index 000000000..e69de29bb diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 69553ad52..fe6a63466 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -157,21 +157,19 @@ function createStripedCanvas(isStrict) { ## Abstract -The newest LLMs have support for returning properly formatted JSON responses, -making it easier for clients to parse complex responses. -This makes it tempting for AI coding applications to -use JSON replies to -receive code from LLMs. +Current LLMs have support for returning properly formatted JSON, +making it easier for clients to reliably parse complex responses. +It therefore seems attractive for +AI coding applications ask LLMs to return code in structure JSON replies. Unfortunately, LLMs write worse code when asked to wrap it in JSON, harming their ability to correctly solve coding tasks. On a variant of the aider code editing benchmark, -JSON-wrapping code +asking for JSON-wrapped code often significantly harms coding -performance -compared to returning code as plain text. +performance. This holds true across many top coding LLMs, -including OpenAI's new gpt-4o-2024-08-06 +including OpenAI's latest model gpt-4o-2024-08-06 which has strong JSON support. ## Introduction From f91faf52dc412c95acd485fe84b6027504b92a46 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 10:57:47 -0700 Subject: [PATCH 30/34] feat: Add code-in-json-syntax.js and update code-in-json-benchmark.js --- .../_includes/code-in-json-benchmark.js | 139 ++++++++++++++++++ .../website/_includes/code-in-json-syntax.js | 93 ++++++++++++ 2 files changed, 232 insertions(+) create mode 100644 aider/website/_includes/code-in-json-syntax.js diff --git a/aider/website/_includes/code-in-json-benchmark.js b/aider/website/_includes/code-in-json-benchmark.js index e69de29bb..93b1ff857 100644 --- a/aider/website/_includes/code-in-json-benchmark.js +++ b/aider/website/_includes/code-in-json-benchmark.js @@ -0,0 +1,139 @@ +
+ +
+ + + diff --git a/aider/website/_includes/code-in-json-syntax.js b/aider/website/_includes/code-in-json-syntax.js new file mode 100644 index 000000000..4008484d1 --- /dev/null +++ b/aider/website/_includes/code-in-json-syntax.js @@ -0,0 +1,93 @@ +
+ +
+ + From 353b63109128c883dfcc1ffdc3267780963df382 Mon Sep 17 00:00:00 2001 From: "Paul Gauthier (aider)" Date: Thu, 15 Aug 2024 10:57:48 -0700 Subject: [PATCH 31/34] feat: Add bar value labels to charts --- .../_includes/code-in-json-benchmark.js | 33 ++++++++++++++++++- .../website/_includes/code-in-json-syntax.js | 33 ++++++++++++++++++- 2 files changed, 64 insertions(+), 2 deletions(-) diff --git a/aider/website/_includes/code-in-json-benchmark.js b/aider/website/_includes/code-in-json-benchmark.js index 93b1ff857..0a8f75e74 100644 --- a/aider/website/_includes/code-in-json-benchmark.js +++ b/aider/website/_includes/code-in-json-benchmark.js @@ -71,9 +71,40 @@ document.addEventListener('DOMContentLoaded', function () { }, legend: { position: 'top', + }, + tooltip: { + callbacks: { + label: function(context) { + let label = context.dataset.label || ''; + if (label) { + label += ': '; + } + if (context.parsed.y !== null) { + label += context.parsed.y.toFixed(1) + '%'; + } + return label; + } + } } } - } + }, + plugins: [{ + afterDraw: function(chart) { + var ctx = chart.ctx; + chart.data.datasets.forEach(function(dataset, i) { + var meta = chart.getDatasetMeta(i); + meta.data.forEach(function(bar, index) { + var data = dataset.data[index]; + if (data !== null) { + ctx.fillStyle = '#000000'; + ctx.textAlign = 'center'; + ctx.textBaseline = 'bottom'; + ctx.fillText(data.toFixed(1) + '%', bar.x, bar.y - 5); + } + }); + }); + } + }] }; // Adjust chart height based on screen width diff --git a/aider/website/_includes/code-in-json-syntax.js b/aider/website/_includes/code-in-json-syntax.js index 4008484d1..77d347cda 100644 --- a/aider/website/_includes/code-in-json-syntax.js +++ b/aider/website/_includes/code-in-json-syntax.js @@ -69,9 +69,40 @@ document.addEventListener('DOMContentLoaded', function () { }, legend: { position: 'top', + }, + tooltip: { + callbacks: { + label: function(context) { + let label = context.dataset.label || ''; + if (label) { + label += ': '; + } + if (context.parsed.y !== null) { + label += context.parsed.y; + } + return label; + } + } } } - } + }, + plugins: [{ + afterDraw: function(chart) { + var ctx = chart.ctx; + chart.data.datasets.forEach(function(dataset, i) { + var meta = chart.getDatasetMeta(i); + meta.data.forEach(function(bar, index) { + var data = dataset.data[index]; + if (data !== null) { + ctx.fillStyle = '#000000'; + ctx.textAlign = 'center'; + ctx.textBaseline = 'bottom'; + ctx.fillText(data, bar.x, bar.y - 5); + } + }); + }); + } + }] }; // Adjust chart height based on screen width From 679e1b8990a4059c51e5fc0827bc9f3f7e81da7c Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 11:13:20 -0700 Subject: [PATCH 32/34] copy --- aider/website/_data/code-in-json.yml | 210 +++++++------- .../website/_includes/code-in-json-syntax.js | 3 +- .../website/_posts/2024-08-14-code-in-json.md | 274 ++---------------- 3 files changed, 136 insertions(+), 351 deletions(-) diff --git a/aider/website/_data/code-in-json.yml b/aider/website/_data/code-in-json.yml index 78efd129f..d983aefa8 100644 --- a/aider/website/_data/code-in-json.yml +++ b/aider/website/_data/code-in-json.yml @@ -40,27 +40,6 @@ versions: 0.50.2-dev seconds_per_case: 5.7 total_cost: 0.8417 -- dirname: 2024-08-15-13-20-11--json-no-lint-gpt-4o-2024-05-13-whole - test_cases: 133 - model: gpt-4o-2024-05-13 - edit_format: Markdown - commit_hash: bac04a2 - pass_rate_1: 56.4 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 0 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 1 - command: aider --model gpt-4o-2024-05-13 - date: 2024-08-15 - versions: 0.50.2-dev - seconds_per_case: 8.0 - total_cost: 1.5034 - dirname: 2024-08-15-13-21-55--json-no-lint-gpt-4o-2024-05-13-func test_cases: 133 model: gpt-4o-2024-05-13 @@ -208,27 +187,6 @@ versions: 0.50.2-dev seconds_per_case: 6.4 total_cost: 0.8390 -- dirname: 2024-08-15-13-53-23--json-no-lint-gpt-4o-2024-05-13-whole-2 - test_cases: 133 - model: gpt-4o-2024-05-13 - edit_format: Markdown - commit_hash: bac04a2 - pass_rate_1: 59.4 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 0 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 0 - command: aider --model gpt-4o-2024-05-13 - date: 2024-08-15 - versions: 0.50.2-dev - seconds_per_case: 7.4 - total_cost: 1.4996 - dirname: 2024-08-15-13-54-53--json-no-lint-gpt-4o-2024-05-13-func-2 test_cases: 133 model: gpt-4o-2024-05-13 @@ -376,27 +334,6 @@ versions: 0.50.2-dev seconds_per_case: 5.6 total_cost: 0.8220 -- dirname: 2024-08-15-14-14-40--json-no-lint-gpt-4o-2024-05-13-whole-3 - test_cases: 133 - model: gpt-4o-2024-05-13 - edit_format: Markdown - commit_hash: bac04a2 - pass_rate_1: 61.7 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 6 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 1 - command: aider --model gpt-4o-2024-05-13 - date: 2024-08-15 - versions: 0.50.2-dev - seconds_per_case: 8.8 - total_cost: 1.4993 - dirname: 2024-08-15-14-16-34--json-no-lint-gpt-4o-2024-05-13-func-3 test_cases: 133 model: gpt-4o-2024-05-13 @@ -544,27 +481,6 @@ versions: 0.50.2-dev seconds_per_case: 6.0 total_cost: 0.8394 -- dirname: 2024-08-15-14-30-48--json-no-lint-gpt-4o-2024-05-13-whole-4 - test_cases: 133 - model: gpt-4o-2024-05-13 - edit_format: Markdown - commit_hash: bac04a2 - pass_rate_1: 61.7 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 6 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 0 - command: aider --model gpt-4o-2024-05-13 - date: 2024-08-15 - versions: 0.50.2-dev - seconds_per_case: 12.3 - total_cost: 1.4919 - dirname: 2024-08-15-14-32-58--json-no-lint-gpt-4o-2024-05-13-func-4 test_cases: 133 model: gpt-4o-2024-05-13 @@ -712,27 +628,6 @@ versions: 0.50.2-dev seconds_per_case: 6.3 total_cost: 0.8354 -- dirname: 2024-08-15-14-47-39--json-no-lint-gpt-4o-2024-05-13-whole-5 - test_cases: 133 - model: gpt-4o-2024-05-13 - edit_format: Markdown - commit_hash: bac04a2 - pass_rate_1: 60.2 - percent_cases_well_formed: 100.0 - error_outputs: 0 - num_malformed_responses: 0 - num_with_malformed_responses: 0 - user_asks: 0 - lazy_comments: 0 - syntax_errors: 9 - indentation_errors: 0 - exhausted_context_windows: 0 - test_timeouts: 1 - command: aider --model gpt-4o-2024-05-13 - date: 2024-08-15 - versions: 0.50.2-dev - seconds_per_case: 10.7 - total_cost: 1.4982 - dirname: 2024-08-15-14-49-44--json-no-lint-gpt-4o-2024-05-13-func-5 test_cases: 133 model: gpt-4o-2024-05-13 @@ -922,3 +817,108 @@ versions: 0.50.2-dev seconds_per_case: 6.1 total_cost: 0.8415 +- dirname: 2024-08-15-17-36-22--json-no-lint-again-gpt-4o-2024-05-13-whole-1 + test_cases: 133 + model: gpt-4o-2024-05-13 + edit_format: Markdown + commit_hash: ed94379 + pass_rate_1: 60.2 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 7 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.8 + total_cost: 1.5110 +- dirname: 2024-08-15-17-38-13--json-no-lint-again-gpt-4o-2024-05-13-whole-2 + test_cases: 133 + model: gpt-4o-2024-05-13 + edit_format: Markdown + commit_hash: ed94379 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.0 + total_cost: 1.4954 +- dirname: 2024-08-15-17-40-10--json-no-lint-again-gpt-4o-2024-05-13-whole-3 + test_cases: 133 + model: gpt-4o-2024-05-13 + edit_format: Markdown + commit_hash: ed94379 + pass_rate_1: 60.9 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 0 + command: aider --model gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 6.8 + total_cost: 1.4999 +- dirname: 2024-08-15-17-41-30--json-no-lint-again-gpt-4o-2024-05-13-whole-4 + test_cases: 133 + model: gpt-4o-2024-05-13 + edit_format: Markdown + commit_hash: ed94379 + pass_rate_1: 58.6 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.4 + total_cost: 1.4848 +- dirname: 2024-08-15-17-43-12--json-no-lint-again-gpt-4o-2024-05-13-whole-5 + test_cases: 133 + model: gpt-4o-2024-05-13 + edit_format: Markdown + commit_hash: ed94379 + pass_rate_1: 59.4 + percent_cases_well_formed: 100.0 + error_outputs: 0 + num_malformed_responses: 0 + num_with_malformed_responses: 0 + user_asks: 0 + lazy_comments: 0 + syntax_errors: 0 + indentation_errors: 0 + exhausted_context_windows: 0 + test_timeouts: 1 + command: aider --model gpt-4o-2024-05-13 + date: 2024-08-15 + versions: 0.50.2-dev + seconds_per_case: 7.6 + total_cost: 1.4948 diff --git a/aider/website/_includes/code-in-json-syntax.js b/aider/website/_includes/code-in-json-syntax.js index 77d347cda..b315edea9 100644 --- a/aider/website/_includes/code-in-json-syntax.js +++ b/aider/website/_includes/code-in-json-syntax.js @@ -56,7 +56,8 @@ document.addEventListener('DOMContentLoaded', function () { title: { display: true, text: 'Total syntactic errors from 5 runs' - } + }, + max: 35 } }, plugins: { diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index fe6a63466..6546e1dfa 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -12,155 +12,12 @@ nav_exclude: true # LLMs are bad at returning code in JSON -
- -
- - - - -> Figure 1: Benchmark scores of models using either plain markdown text or JSON to return code. -> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. - ## Abstract Current LLMs have support for returning properly formatted JSON, making it easier for clients to reliably parse complex responses. It therefore seems attractive for -AI coding applications ask LLMs to return code in structure JSON replies. +AI coding applications ask LLMs to return code in structured JSON replies. Unfortunately, LLMs write worse code when asked to wrap it in JSON, harming their ability to correctly solve coding tasks. @@ -172,6 +29,13 @@ This holds true across many top coding LLMs, including OpenAI's latest model gpt-4o-2024-08-06 which has strong JSON support. +{% include code-in-json-benchmark.js %} + +> Figure 1: Benchmark scores of models using either plain markdown text or JSON to return code, +> averaged over 5 runs. +> Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. + + ## Introduction A lot of people wonder why aider doesn't use LLM tools for code editing. @@ -244,9 +108,8 @@ capable models. OpenAI's newly announced support for "strict" JSON seemed like a good reason to investigate whether the newest models are still handicapped by JSON-wrapping code. -The graph above shows benchmark -results from -4 of the strongest code editing models: +Four of the strongest code editing models were benchmarked +to assess the impact of JSON-wrapping code: - claude-3-5-sonnet-20240620 - deepseek-coder (V2 0724) @@ -302,15 +165,16 @@ portions of a file. This experimental setup is designed to highlight the effects of JSON-wrapping on the LLMs ability to write code to solve a task. -The results in the graph are the average of 5 runs for each -model & strategy combination. ## Results +Each of the 4 models was benchmarked 5 times using the different +strategies for returning code. ## Overall coding skill -All of the models did worse on the benchmark when asked to +As shown in Figure 1, +all of the models did worse on the benchmark when asked to return JSON-wrapped code in a tool function call. Most did significantly worse, performing far below the result obtained with the markdown strategy. @@ -319,109 +183,29 @@ Some noteworthy observations: - OpenAI's gpt-4o-2024-05-13 was the only model where the markdown and JSON results were close. Using JSON only dropped the score by 0.3 percent, a difference which is -probably within the margin of error for 5 trials. -- The use of OpenAI's new strict mode seemed to harm the results for gpt-4o-2024-08-06 -as compared to non-strict JSON. +within the margin of error for 5 trials. +- The use of OpenAI's new strict mode offered no improvement +as compared to non-strict JSON. Of course, both JSON results were well below the markdown result. - The results from Sonnet and DeepSeek Coder suffered the worst harm from JSON wrapping. ## Syntax errors -
- -
+Figure 2 shows the number of syntactic errors found in the code produced by each +model and code wrapping strategy. +Models tend to make more syntactic errors when asked to wrap code in JSON. - - -> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code. +> Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code, +> totaled from 5 runs. > Models tend to make more syntactic errors when asked to wrap code in JSON. From 479f73871b7529b1f354bc756a8a76ee8023f2aa Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 12:14:39 -0700 Subject: [PATCH 33/34] more debug on unexepcted error --- aider/coders/base_coder.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/aider/coders/base_coder.py b/aider/coders/base_coder.py index c7f9816e8..8c3484c69 100755 --- a/aider/coders/base_coder.py +++ b/aider/coders/base_coder.py @@ -1009,7 +1009,8 @@ class Coder: ) except Exception as err: self.io.tool_error(f"Unexpected error: {err}") - traceback.print_exc() + lines = traceback.format_exception(type(err), err, err.__traceback__) + self.io.tool_error("".join(lines)) return finally: if self.mdstream: From 3e5dba8d5ca6ebe2f14783136a3d890ee6b0ae5d Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 15 Aug 2024 12:14:49 -0700 Subject: [PATCH 34/34] copy --- .../website/_includes/code-in-json-syntax.js | 4 +- .../website/_posts/2024-08-14-code-in-json.md | 87 ++++++++++--------- 2 files changed, 47 insertions(+), 44 deletions(-) diff --git a/aider/website/_includes/code-in-json-syntax.js b/aider/website/_includes/code-in-json-syntax.js index b315edea9..5c0e652b1 100644 --- a/aider/website/_includes/code-in-json-syntax.js +++ b/aider/website/_includes/code-in-json-syntax.js @@ -55,7 +55,7 @@ document.addEventListener('DOMContentLoaded', function () { beginAtZero: true, title: { display: true, - text: 'Total syntactic errors from 5 runs' + text: 'Total syntax errors from 5 runs' }, max: 35 } @@ -63,7 +63,7 @@ document.addEventListener('DOMContentLoaded', function () { plugins: { title: { display: true, - text: 'Syntactic errors by model and code wrapping strategy', + text: 'Syntax errors by model and code wrapping strategy', font: { size: 16 } diff --git a/aider/website/_posts/2024-08-14-code-in-json.md b/aider/website/_posts/2024-08-14-code-in-json.md index 6546e1dfa..59cc444f4 100644 --- a/aider/website/_posts/2024-08-14-code-in-json.md +++ b/aider/website/_posts/2024-08-14-code-in-json.md @@ -12,8 +12,6 @@ nav_exclude: true # LLMs are bad at returning code in JSON -## Abstract - Current LLMs have support for returning properly formatted JSON, making it easier for clients to reliably parse complex responses. It therefore seems attractive for @@ -23,8 +21,7 @@ LLMs write worse code when asked to wrap it in JSON, harming their ability to correctly solve coding tasks. On a variant of the aider code editing benchmark, asking for JSON-wrapped code -often significantly harms coding -performance. +often harms coding performance. This holds true across many top coding LLMs, including OpenAI's latest model gpt-4o-2024-08-06 which has strong JSON support. @@ -36,7 +33,7 @@ which has strong JSON support. > Models produce better code when they return it as plain markdown text, as compared to wrapping it in JSON for a tool function call. -## Introduction +## Background A lot of people wonder why aider doesn't use LLM tools for code editing. Instead, aider asks for code edits in plain text, like this: @@ -66,14 +63,17 @@ which would return a structured JSON response: ``` This has become even more tempting as LLM providers -continue to improve their tooling for reliably generating -valid JSON. -For example, OpenAI recently announced the ability to -[strictly enforce that JSON responses will be syntactically correct -and conform to a specified schema](https://openai.com/index/introducing-structured-outputs-in-the-api/). +continue to improve their tooling for reliably generating JSON. +For example, +[OpenAI recently announced](https://openai.com/index/introducing-structured-outputs-in-the-api/) +the ability to +strictly enforce that JSON responses will be syntactically correct +and conform to a specified schema. + But producing valid (schema compliant) JSON is not sufficient for working with AI generated code. -The code inside the JSON has to be valid and high quality too. +The code inside the JSON has to correctly solve the requested task +and be free from syntax errors. Unfortunately, LLMs write worse code when they're asked to wrap it in JSON. @@ -108,29 +108,23 @@ capable models. OpenAI's newly announced support for "strict" JSON seemed like a good reason to investigate whether the newest models are still handicapped by JSON-wrapping code. -Four of the strongest code editing models were benchmarked -to assess the impact of JSON-wrapping code: +The results presented here were based on +the +[aider "code editing" benchmark](/2023/07/02/benchmarks.html#the-benchmark) +of 133 practice exercises from the Exercism python repository. +Models were +restricted to a single attempt, +without a second try to fix errors as is normal in the aider benchmark. -- claude-3-5-sonnet-20240620 -- deepseek-coder (V2 0724) -- gpt-4o-2024-05-13 -- gpt-4o-2024-08-06 - -Each model was given one try to solve -[133 practice exercises from the Exercism python repository](/2023/07/02/benchmarks.html#the-benchmark). -This is the standard aider "code editing" benchmark, but restricted to a single attempt -without a second try to "fix" any errors. - -The benchmark assessed the models coding ability -using different strategies for returning code: +The performance of each model was compared across different strategies for returning code: - **Markdown** -- the model returned the whole source code file in standard markdown triple-backtick fences. -- **JSON** -- the model used a tool function call to return the whole source code file. This requires the LLM to wrap the code in JSON. +- **JSON** -- the model used a tool function call to return the whole source code file. This required the LLM to wrap the code in JSON. - **JSON (strict)** -- the same as the "JSON" strategy, but with `strict=True`. Only gpt-4o-2024-08-06 supports this setting. The markdown strategy is the same as aider's "whole" edit format, where the -LLM would return a source file like this: +LLM returns a source file like this: ```` Here is the program you asked for which prints "Hello": @@ -163,13 +157,20 @@ than correctly formulating instructions to edit portions of a file. -This experimental setup is designed to highlight +This experimental setup is designed to quantify the effects of JSON-wrapping on the LLMs ability to write code to solve a task. ## Results -Each of the 4 models was benchmarked 5 times using the different -strategies for returning code. +Four of the strongest code editing models were benchmarked +to assess the impact of JSON-wrapping code: + +- claude-3-5-sonnet-20240620 +- deepseek-coder (V2 0724) +- gpt-4o-2024-05-13 +- gpt-4o-2024-08-06 + +Each combination of model and code wrapping strategy was benchmarked 5 times. ## Overall coding skill @@ -191,22 +192,24 @@ Of course, both JSON results were well below the markdown result. ## Syntax errors -Figure 2 shows the number of syntactic errors found in the code produced by each -model and code wrapping strategy. -Models tend to make more syntactic errors when asked to wrap code in JSON. +Models tend to make more syntax errors when asked to wrap code in JSON. +Figure 2 shows the number of syntax errors found in the code produced by each +model and code wrapping strategy, +totaling up `SyntaxError` and `IndentationError` errors from all 5 runs. -Sonnet avoided syntactic errors regardless of the code wrapping strategy, -but its benchmark scores in Figure 1 were lower with JSON. -This seems to indicate that JSON-wrapping -does more than simply raise the syntactic difficulty in coding. -It may distract or challenge the model in a way that -reduces its ability to reason about coding problems. + +Sonnet's results seems to indicate that the negative effects of JSON-wrapping +go beyond syntactic difficulties. +Sonnet avoided syntax errors regardless of the code wrapping strategy, +but its benchmark scores in Figure 1 were nonetheless lower with JSON. +This implies that JSON-wrapping may distract or challenge models in a way that +reduces their ability to reason about solving coding problems. {% include code-in-json-syntax.js %} > Figure 2: Number of `SyntaxError` and `IndentationError` errors found in model generated code, > totaled from 5 runs. -> Models tend to make more syntactic errors when asked to wrap code in JSON. +> Models tend to make more syntax and formatting errors when asked to wrap code in JSON. ## Conclusions @@ -217,7 +220,7 @@ the conclusion seems unchanged: LLMs are bad at returning code in JSON. OpenAI appears to be making progress in allowing LLMs to return code in structured JSON responses without harming the code quality. -But it seems premature to consider switching from plain text +But it still seems premature to consider switching from plain text to JSON-wrapped code. @@ -227,4 +230,4 @@ The results presented here are not directly comparable to results from the main [aider LLM leaderboard](https://aider.chat/docs/leaderboards/). A number of settings were changed to simplify the benchmark -in order to focus on comparing plain text and JSON wrapped code. +in order to focus on comparing plain text and JSON-wrapped code.