mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 18:25:00 +00:00
copy
This commit is contained in:
parent
bc7610cdda
commit
56dc5f82f6
1 changed files with 20 additions and 10 deletions
|
@ -51,12 +51,12 @@ You want to minimize the "cognitive overhead" of formatting the response, so tha
|
||||||
GPT can focus on the task at hand.
|
GPT can focus on the task at hand.
|
||||||
As an analogy, you wouldn't expect a good result if you asked a junior developer to
|
As an analogy, you wouldn't expect a good result if you asked a junior developer to
|
||||||
implement a new feature by hand typing the required code
|
implement a new feature by hand typing the required code
|
||||||
changes as `diff -c` formatted updates.
|
changes as `diff -c` formatted edits.
|
||||||
|
|
||||||
Using more complex output formats seems to cause two problems:
|
Using more complex output formats seems to cause two problems:
|
||||||
|
|
||||||
- It makes GPT write worse code. Keeping the output format simple seems to leave GPT with more attention to devote to the actual coding task.
|
- It makes GPT write worse code. Keeping the output format simple seems to leave GPT with more attention to devote to the actual coding task.
|
||||||
- It makes GPT less likely to adhere to the output format. This makes it harder to correctly identify and apply the edits it is trying to make.
|
- It makes GPT less likely to adhere to the output format. This makes it harder for tooling to correctly identify and apply the edits it is trying to make.
|
||||||
|
|
||||||
I had hoped that the new function calling API would enable more reliable use of
|
I had hoped that the new function calling API would enable more reliable use of
|
||||||
structured output formats, but it does not appear to be a panacea
|
structured output formats, but it does not appear to be a panacea
|
||||||
|
@ -121,17 +121,17 @@ Just the error output from failed tests.
|
||||||
|
|
||||||
In summary, passing an exercise means GPT was able to:
|
In summary, passing an exercise means GPT was able to:
|
||||||
|
|
||||||
- understand the instructions,
|
|
||||||
- write the required code (possibly after reviewing test error output),
|
- write the required code (possibly after reviewing test error output),
|
||||||
- package up all of this code and edits into the correct format so that aider can process and save it to the implementation file.
|
- correctly package up all of this code into the edit format so that aider can process and save it to the implementation file.
|
||||||
|
|
||||||
Conversely, failing an exercise only requires a breakdown in one of those steps.
|
Conversely, failing an exercise only requires a breakdown in one of those steps.
|
||||||
In practice, GPT fails at different steps in different exercises.
|
In practice, GPT fails at different steps in different exercises.
|
||||||
Sometimes it just writes the wrong code.
|
Sometimes it just writes the wrong code.
|
||||||
Other times, the code looks okay, but it can't format the edits in a way that confirms to the edit format so the code isn't saved properly.
|
Other times,
|
||||||
|
it fails to format the code edits in a way that conforms to the edit format so the code isn't saved properly.
|
||||||
|
|
||||||
It's worth keeping in mind that changing the edit format often affects both aspects of GPT's performance on the exercises.
|
It's worth keeping in mind that changing the edit format often affects both aspects of GPT's performance on the exercises.
|
||||||
Complex edit formats make it write worse code *and* make it less successful at formatting the edits correctly.
|
Complex edit formats often make it write worse code *and* make it less successful at formatting the edits correctly.
|
||||||
|
|
||||||
|
|
||||||
## Edit formats
|
## Edit formats
|
||||||
|
@ -170,6 +170,13 @@ Each edit is a fenced code block that
|
||||||
specifies the filename and a chunk of ORIGINAL and UPDATED code.
|
specifies the filename and a chunk of ORIGINAL and UPDATED code.
|
||||||
GPT provides some original lines from the file and then a new updated set of lines.
|
GPT provides some original lines from the file and then a new updated set of lines.
|
||||||
|
|
||||||
|
While GPT-3.5 is sometimes able to generate this `diff` edit format,
|
||||||
|
it often uses it in a pathological way.
|
||||||
|
It puts the *entire* original source file in the ORIGINAL block
|
||||||
|
and the entire updated file in the UPDATED block.
|
||||||
|
This is strictly worse than just using the `whole` edit format,
|
||||||
|
since GPT is sending 2 full copies of the file.
|
||||||
|
|
||||||
````
|
````
|
||||||
Here are the changes you requested to demo.py:
|
Here are the changes you requested to demo.py:
|
||||||
|
|
||||||
|
@ -246,7 +253,7 @@ It feels like it might be getting confused by fine tuning that was done for Chat
|
||||||
|
|
||||||
## Randomness
|
## Randomness
|
||||||
|
|
||||||
The benchmark goes to some trouble to be deterministic, always sending identical
|
The benchmark attempts to be deterministic, always sending identical
|
||||||
requests for each exercise on repeated runs.
|
requests for each exercise on repeated runs.
|
||||||
As part of this effort,
|
As part of this effort,
|
||||||
when sending test error output to GPT
|
when sending test error output to GPT
|
||||||
|
@ -254,14 +261,17 @@ it removes the wall-clock timing information that
|
||||||
is normally included by the `unittest` module.
|
is normally included by the `unittest` module.
|
||||||
|
|
||||||
The benchmarking harness also logs sha hashes of the
|
The benchmarking harness also logs sha hashes of the
|
||||||
API requests and replies.
|
all the OpenAI API requests and replies.
|
||||||
This makes it easy to identify sources of randomness or nondeterminism
|
This makes it possible to
|
||||||
|
detect randomness or nondeterminism
|
||||||
in the bechmarking process.
|
in the bechmarking process.
|
||||||
|
|
||||||
It turns out that the OpenAI chat APIs are not deterministic, even at `temperature=0`.
|
It turns out that the OpenAI chat APIs are not deterministic, even at `temperature=0`.
|
||||||
The same identical request will produce multiple distinct responses,
|
The same identical request will produce multiple distinct responses,
|
||||||
usually on the order of 3-6 different variations. This feels
|
usually on the order of 3-6 different variations. This feels
|
||||||
like they are load balancing across a number of slightly different
|
like OpenAI may be
|
||||||
|
load balancing their API
|
||||||
|
across a number of slightly different
|
||||||
instances of the model.
|
instances of the model.
|
||||||
|
|
||||||
For some exercises, some of these variable responses pass the unit tests while
|
For some exercises, some of these variable responses pass the unit tests while
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue