copy

2025-05-28 00:05:01 +00:00 · 2023-11-08 08:36:48 -08:00 · 2023-11-08 08:36:48 -08:00 · 59fed25cd1
commit 59fed25cd1
parent 2c83287c46
4 changed files with 72 additions and 69 deletions
--- a/docs/benchmarks-1106.md
+++ b/docs/benchmarks-1106.md
@ -45,22 +45,20 @@ This is the edit format that aider uses by default with gpt-4.

 - The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up.
 - **It seems better at producing correct code on the first try**. It gets
-~54% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
+53% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
 - The new model seems to perform similar
-(~63%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
+(~62%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.

 **These are preliminary results.**
 OpenAI is enforcing very low
 rate limits on the new GPT-4 model.
-The rate limiting is disrupting the normal flow of the benchmarking process,
-which needs to be restarted after pauses.
-The benchmarking tool is capable of such restarts, but
-I will trust a "clean" run much better once the rate limits are relaxed.
-The results currently reflect
-130
-out of the 133 Exercism problems.
-The problems are selected in random order, so results should be *roughly*
-indicative of the full benchmark.
+The rate limiting disrupts the the benchmarking process,
+requiring it to be paused and restarted frequently.
+It took ~20 partial runs over ~2 days to complete all 133 Exercism problems.
+The benchmarking harness is designed to stop/restart in this manner,
+but results from a single "clean" run would be more trustworthy.
+Once the rate limits are relaxed I will do a clean
+run of the entire benchmark.

 ### gpt-3.5-turbo-1106

--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@ -69,7 +69,7 @@ More details on the benchmark, edit formats and results are discussed below.

 ## The benchmark

-The benchmark uses 
+The benchmark uses
 [133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
 These
 exercises were designed to help individuals learn Python and hone
@ -199,7 +199,7 @@ demo.py

 ### whole-func

-The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py) 
+The [whole-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/wholefile_func_coder.py)
 format requests updated copies of whole files to be returned using the function call API.


@ -218,7 +218,7 @@ format requests updated copies of whole files to be returned using the function

 The
 [diff-func](https://github.com/paul-gauthier/aider/blob/main/aider/coders/editblock_func_coder.py)
-format requests a list of 
+format requests a list of
 original/updated style edits to be returned using the function call API.

 ```
@ -235,7 +235,7 @@ original/updated style edits to be returned using the function call API.
            ],
        }
    ]
-}       
+}
 ```

 ## GPT-3.5's performance
@ -307,7 +307,7 @@ The benchmark harness also logs SHA hashes of
 all the OpenAI API requests and replies.
 This makes it possible to
 detect randomness or nondeterminism
-in the bechmarking process.
+in the benchmarking process.

 It turns out that the OpenAI chat APIs are not deterministic, even at
 `temperature=0`.  The same identical request will produce multiple