mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-29 16:54:59 +00:00
copy
This commit is contained in:
parent
cd6a278684
commit
c793511957
1 changed files with 6 additions and 6 deletions
|
@ -217,20 +217,21 @@ usually on the order of 3-6 different variations. This feels
|
||||||
like they are load balancing across a number of slightly different
|
like they are load balancing across a number of slightly different
|
||||||
instances of the model.
|
instances of the model.
|
||||||
|
|
||||||
For some exercises, some of this variable responses pass the unit tests and other
|
For some exercises, some of these variable responses pass the unit tests while
|
||||||
responses do not.
|
other variants do not. Whether the exercises passes is therefore
|
||||||
|
a bit random, depending on which variant OpenAI returns.
|
||||||
|
|
||||||
Given that, it would be ideal to run all 133 exercises many times for each
|
Given that, it would be ideal to run all 133 exercises many times for each
|
||||||
model/edit-format combination and report an average performance.
|
model/edit-format combination and report an average performance.
|
||||||
This would average away the effect of the API variance.
|
This would average away the effect of the API variance.
|
||||||
That would also significantly increase the cost of this sort of benchmarking,
|
It would also significantly increase the cost of this sort of benchmarking,
|
||||||
so I didn't do that.
|
so I didn't do that.
|
||||||
|
|
||||||
Benchmarking against 133 exercises provides some robustness all by itself, since
|
Benchmarking against 133 exercises provides some robustness all by itself, since
|
||||||
we are measuring the performance across many exercises.
|
we are measuring the performance across many exercises.
|
||||||
|
|
||||||
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
||||||
I ran the `gpt-3.5-turbo-0613 + whole` experiment 10 times.
|
I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times.
|
||||||
You'll see one set of error bars in the graph, which demark
|
You'll see one set of error bars in the graph, which demark
|
||||||
the range of results across those 10 runs.
|
the range of results across those 10 runs.
|
||||||
|
|
||||||
|
@ -239,7 +240,7 @@ contribute to a large variance in the benchmark results.
|
||||||
|
|
||||||
## Conclusions
|
## Conclusions
|
||||||
|
|
||||||
Based on these benchmarking results, aider will continue to usea
|
Based on these benchmarking results, aider will continue to use
|
||||||
`whole` for gpt-3.5 and `diff` for gpt-4.
|
`whole` for gpt-3.5 and `diff` for gpt-4.
|
||||||
While `gpt-4` gets slightly better results with the `whole` edit format,
|
While `gpt-4` gets slightly better results with the `whole` edit format,
|
||||||
it significantly increases costs and latency compared to `diff`.
|
it significantly increases costs and latency compared to `diff`.
|
||||||
|
@ -247,4 +248,3 @@ Since `gpt-4` is already costly and slow, this seems like an acceptable
|
||||||
tradeoff.
|
tradeoff.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue