mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-20 12:24:59 +00:00
copy
This commit is contained in:
parent
2ee31fd251
commit
2ce5576aa9
1 changed files with 4 additions and 4 deletions
|
@ -215,16 +215,16 @@ usually on the order of 3-6 different variations. This feels
|
||||||
like they are load balancing across a number of different
|
like they are load balancing across a number of different
|
||||||
instances of the model.
|
instances of the model.
|
||||||
|
|
||||||
For some exercises, some responses pass the unit tests and other
|
For some exercises, some of this variable responses pass the unit tests and other
|
||||||
responses don't.
|
responses do not.
|
||||||
|
|
||||||
Given that, it would be ideal to run all 133 exercises many times for each
|
Given that, it would be ideal to run all 133 exercises many times for each
|
||||||
model + edit format combination and report an average performance.
|
model/edit-format combination and report an average performance.
|
||||||
This would average away the effect of the API variance.
|
This would average away the effect of the API variance.
|
||||||
That would also significantly increase the cost of this sort of benchmarking,
|
That would also significantly increase the cost of this sort of benchmarking,
|
||||||
so I didn't do that.
|
so I didn't do that.
|
||||||
|
|
||||||
Running 133 test cases provides some robustness all by itself, since
|
Benchmarking against 133 exercises provides some robustness all by itself, since
|
||||||
we are measuring the performance across many exercises.
|
we are measuring the performance across many exercises.
|
||||||
|
|
||||||
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue