mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-28 16:25:00 +00:00
copy
This commit is contained in:
parent
108328b4be
commit
ae9df00043
1 changed files with 9 additions and 7 deletions
|
@ -56,8 +56,8 @@ More details on the benchmark, edit formats and results are discussed below.
|
|||
|
||||
## The benchmark
|
||||
|
||||
The benchmark uses the 133
|
||||
[practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
|
||||
The benchmark uses
|
||||
[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
|
||||
They were designed for people to learn and practice
|
||||
their python coding skills.
|
||||
|
||||
|
@ -186,7 +186,7 @@ format requests original/updated edits to be returned using the function call AP
|
|||
## GPT-3.5 hallucinates function calls?
|
||||
|
||||
GPT-3.5 was very prone to ignoring the JSON Schema that specified valid functions,
|
||||
and would often return a completely invalid `function_call` fragment with `name="python"`.
|
||||
and would often return a completely invalid `function_call` fragment with `"name": "python"`.
|
||||
|
||||
```
|
||||
"function_call": {
|
||||
|
@ -200,7 +200,7 @@ with the arguments to the function specified in the `name` field.
|
|||
Instead, gpt-3.5 frequently just stuffed the entire python
|
||||
program into that field.
|
||||
|
||||
It feels like it is getting confused with training done for ChatGPT plugins?
|
||||
It feels like it might be getting confused by fine tuning that was done for ChatGPT plugins?
|
||||
|
||||
## Limitations
|
||||
|
||||
|
@ -211,8 +211,9 @@ like they are load balancing across a number of slightly different
|
|||
instances of the model.
|
||||
|
||||
For some exercises, some of these variable responses pass the unit tests while
|
||||
other variants do not. Whether the exercises passes is therefore
|
||||
a bit random, depending on which variant OpenAI returns.
|
||||
other variants do not. Results for exercises like this which are
|
||||
"on the bubble"
|
||||
are therefore a bit random, depending on which variant OpenAI returns.
|
||||
|
||||
Given that, it would be ideal to run all 133 exercises many times for each
|
||||
model/edit-format combination and report an average performance.
|
||||
|
@ -224,7 +225,8 @@ Benchmarking against 133 exercises provides some robustness all by itself, since
|
|||
we are measuring the performance across many exercises.
|
||||
|
||||
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
||||
I ran the `gpt-3.5-turbo-0613 / whole` experiment 10 times.
|
||||
I ran the all 133 exercises 10 times each
|
||||
against `gpt-3.5-turbo-0613` with the `whole` edit format.
|
||||
You'll see one set of error bars in the graph, which demark
|
||||
the range of results across those 10 runs.
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue