copy

2025-05-29 08:44:59 +00:00 · 2023-07-02 08:06:17 -07:00 · 2023-07-02 08:06:17 -07:00 · 93e29eda94
commit 93e29eda94
parent cef990cd98
1 changed files with 13 additions and 20 deletions
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@ -5,28 +5,20 @@
 Aider is an open source command line chat tool that lets you work with GPT to edit
 code in your local git repo.
 You can use aider to have GPT add features, write tests or make other changes to your code.
 To do this, aider needs to be able to reliably recognize when GPT wants to edit local files,
-determine which files to modify and what edits to apply.
+determine which files it wants to modify and what changes to save.
-This direct read/modify/write integration allows
+Such automated
-users to harness GPT's coding skills without
+code editing hinges on the "edit format" portion of the system prompt, which specifies
 needing to repeatedly copy & paste
 code back and forth between their files and a ChatGPT window.
 Successful automated
 code editing hinges on the "edit format", which specifies
 how GPT should structure code edits in its responses.
-Aider instructs GPT to use a specific
+
 edit format as part of the system prompt.
 Aider currently uses simple text based editing formats, but
 [OpenAI's new function calling
 API](https://openai.com/blog/function-calling-and-other-api-updates)
 looks like a promising way to create more structured edit formats.
-Before making such a big change, I wanted
+I wanted
-a quantitative way to assess the benefits
+a quantitative way to assess the potential benefits
-of function based editing.
+of switching aider to function based editing.
 With this in mind, I developed a
 benchmark based on the [Exercism
@ -34,14 +26,15 @@ python](https://github.com/exercism/python) coding exercises.
 This
 benchmark evaluates how effectively aider and GPT can translate a
 natural language coding request into executable code saved into
-files that pass unit tests. It's an end-to-end evaluation of not just
+files that pass unit tests.
 It provides an end-to-end evaluation of not just
 GPT's coding ability, but also its capacity to *edit existing code*
 and *format those code edits* so that aider can save the
 edits to the local source files.
-I ran this code editing benchmark
+I ran the benchmark
-on all the ChatGPT models except `gpt-4-32k`, using a variety of edit formats.
+on all the ChatGPT models (except `gpt-4-32k`), using a variety of edit formats.
-The results were quite interesting:
+The results were interesting:
  - Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results for this `whole` edit format are shown in solid blue in the graph.
  - Using the new functions API for edits performed worse than the above whole file method, for all the models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results for these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
@ -62,7 +55,7 @@ Or should they type up a properly escaped and
 syntactically correct json data structure
 that contains the text of the new code?
-Using more complex output formats with GPT seems to introduce two issues:
+Using more complex output formats with GPT seems to cause two issues:
  - It makes GPT write worse code. Keeping the output format simple seems to allow GPT to devote more attention to the actual coding task.
  - It reduces GPT's adherence to the output format, making it more challenging for tools like aider to accurately identify and apply the edits GPT is attempting to make.