mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-28 16:25:00 +00:00
Updated benchmark reports
This commit is contained in:
parent
a15ac7ebb6
commit
1d0bc3dcb6
5 changed files with 796 additions and 875 deletions
File diff suppressed because it is too large
Load diff
Before Width: | Height: | Size: 55 KiB After Width: | Height: | Size: 54 KiB |
File diff suppressed because it is too large
Load diff
Before Width: | Height: | Size: 48 KiB After Width: | Height: | Size: 47 KiB |
|
@ -77,8 +77,8 @@ def show_stats(dirnames, graphs):
|
|||
elif row.model.startswith(gpt4):
|
||||
row.model = gpt4 + "\n" + row.model[len(gpt4) :]
|
||||
|
||||
if row.model == "gpt-4\n-1106-preview":
|
||||
row.model += "\n(preliminary)"
|
||||
# if row.model == "gpt-4\n-1106-preview":
|
||||
# row.model += "\n(preliminary)"
|
||||
|
||||
if row.completed_tests < 133:
|
||||
print(f"Warning: {row.dir_name} is incomplete: {row.completed_tests}")
|
||||
|
|
|
@ -2,6 +2,8 @@
|
|||
|
||||
[](https://aider.chat/assets/benchmarks-1106.svg)
|
||||
|
||||
[](https://aider.chat/assets/benchmarks-speed-1106.svg)
|
||||
|
||||
[OpenAI just released new versions of GPT-3.5 and GPT-4](https://openai.com/blog/new-models-and-developer-products-announced-at-devday),
|
||||
and there's a lot
|
||||
of interest about their ability to code compared to the previous versions.
|
||||
|
@ -44,22 +46,11 @@ The benchmark gives aider two tries to complete the task:
|
|||
For now, I have only benchmarked the GPT-4 models using the `diff` edit method.
|
||||
This is the edit format that aider uses by default with gpt-4.
|
||||
|
||||
- The new `gpt-4-1106-preview` model seems **much faster** than the earlier GPT-4 models. I won't be able to properly quantify this until the rate limits loosen up.
|
||||
- The new `gpt-4-1106-preview` model seems **2-2.5X faster** than the June GPT-4 model.
|
||||
- **It seems better at producing correct code on the first try**. It gets
|
||||
53% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
|
||||
- The new model seems to perform similar
|
||||
(~62%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
|
||||
|
||||
**These are preliminary results.**
|
||||
OpenAI is enforcing very low
|
||||
rate limits on the new GPT-4 model.
|
||||
The rate limiting disrupts the the benchmarking process,
|
||||
requiring it to be paused and restarted frequently.
|
||||
It took ~20 partial runs over ~2 days to complete all 133 Exercism problems.
|
||||
The benchmarking harness is designed to stop/restart in this manner,
|
||||
but results from a single "clean" run would be more trustworthy.
|
||||
Once the rate limits are relaxed I will do a clean
|
||||
run of the entire benchmark.
|
||||
(~65%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.
|
||||
|
||||
### gpt-3.5-turbo-1106
|
||||
|
||||
|
|
|
@ -37,17 +37,5 @@ generate responses which primarily consist of source code.
|
|||
Some observations:
|
||||
|
||||
- **GPT-3.5 got 6-11x faster.** The `gpt-3.5-turbo-1106` model is 6-11x faster than the June (0613) version which has been the default `gpt-3.5-turbo` model.
|
||||
- **GPT-4 Turbo is 4-5x faster.** The new `gpt-4-1106-preview` model is 4-5x faster than the June (0613) version which has been the default `gpt-4` model.
|
||||
- **GPT-4 Turbo is 2-2.5x faster.** The new `gpt-4-1106-preview` model is 2-2.5x faster than the June (0613) version which has been the default `gpt-4` model.
|
||||
- The old March (0301) version of GPT-3.5 is actually faster than the June (0613) version. This was a surprising discovery.
|
||||
|
||||
**These are preliminary results.**
|
||||
OpenAI is enforcing very low
|
||||
rate limits on the new GPT-4 model.
|
||||
The rate limiting disrupts the benchmarking process,
|
||||
requiring it to run single threaded, pause and restart frequently.
|
||||
These anomolous conditions make it slow to
|
||||
benchmark the new model, and make
|
||||
it less reliable to compare the results with
|
||||
benchmark runs against the older model versions.
|
||||
Once the rate limits are relaxed I will do a clean
|
||||
run of the entire benchmark suite.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue