mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-06 04:35:00 +00:00
copy
This commit is contained in:
parent
8a0909738d
commit
ea15e5c60c
1 changed files with 22 additions and 19 deletions
|
@ -3,13 +3,15 @@
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Aider is an open source command line chat tool that lets you ask GPT for features, changes and
|
Aider is an open source command line chat tool that lets you ask GPT to edit
|
||||||
improvements to code in your local git repos.
|
code in your local git repos.
|
||||||
|
You can use aider to ask GPT to add features, write tests or make other changes and
|
||||||
|
improvements to your code.
|
||||||
|
|
||||||
Having a reliable way for GPT to read/modify/write source files is critical to
|
Having a reliable way for GPT to read/modify/write
|
||||||
using GPT to edit code within an existing codebase.
|
local source code files is a critical component of this functionality.
|
||||||
Making code editing more reliable often
|
Making code editing more reliable often
|
||||||
involves tweaking and experimenting with
|
involves changing and experimenting with
|
||||||
the "edit format" that aider uses.
|
the "edit format" that aider uses.
|
||||||
The edit format is a key part of the system prompt,
|
The edit format is a key part of the system prompt,
|
||||||
specifying how GPT should format code edits in its replies.
|
specifying how GPT should format code edits in its replies.
|
||||||
|
@ -30,11 +32,11 @@ actual runnable code saved into files that pass unit tests.
|
||||||
This is an end-to-end assessment
|
This is an end-to-end assessment
|
||||||
of not just how well GPT can write code, but also how well it
|
of not just how well GPT can write code, but also how well it
|
||||||
can *edit existing code* and
|
can *edit existing code* and
|
||||||
*package up these code changes*
|
*package up those code changes*
|
||||||
so that aider can save the edits to the
|
so that aider can save the edits to the
|
||||||
local source files.
|
local source files.
|
||||||
|
|
||||||
I ran the code editing benchmark
|
I ran this code editing benchmark
|
||||||
on almost all the ChatGPT models, using a variety of edit formats.
|
on almost all the ChatGPT models, using a variety of edit formats.
|
||||||
This produced some interesting results:
|
This produced some interesting results:
|
||||||
|
|
||||||
|
@ -47,8 +49,9 @@ The quantitative benchmark results agree with an intuition that I've been
|
||||||
developing about how to prompt GPT for complex tasks like coding.
|
developing about how to prompt GPT for complex tasks like coding.
|
||||||
You want to minimize the "cognitive overhead" of formatting the response, so that
|
You want to minimize the "cognitive overhead" of formatting the response, so that
|
||||||
GPT can focus on the task at hand.
|
GPT can focus on the task at hand.
|
||||||
For example, you wouldn't expect a good result if you asked a junior developer to
|
As an analogy, you wouldn't expect a good result if you asked a junior developer to
|
||||||
implement a new feature by hand typing `diff -c` formatted updates to the current code.
|
implement a new feature by hand typing the required code
|
||||||
|
changes as `diff -c` formatted updates.
|
||||||
|
|
||||||
Using more complex output formats seem to cause two problems:
|
Using more complex output formats seem to cause two problems:
|
||||||
|
|
||||||
|
@ -66,7 +69,7 @@ More details on the benchmark, edit formats and results are discussed below.
|
||||||
|
|
||||||
The benchmark uses
|
The benchmark uses
|
||||||
[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
|
[133 practice exercises from the Exercism python repository](https://github.com/exercism/python/tree/main/exercises/practice).
|
||||||
They were designed for people to learn python and practice
|
These exercises were designed for people to learn python and practice
|
||||||
their coding skills.
|
their coding skills.
|
||||||
|
|
||||||
Each exercise has:
|
Each exercise has:
|
||||||
|
@ -77,7 +80,7 @@ Each exercise has:
|
||||||
|
|
||||||
The goal is for GPT to read the instructions, implement the provided functions/class skeletons
|
The goal is for GPT to read the instructions, implement the provided functions/class skeletons
|
||||||
and pass all the unit tests. The benchmark measures what percentage of
|
and pass all the unit tests. The benchmark measures what percentage of
|
||||||
the 133 exercises are completed successfully, which means all the associated unit tests passed.
|
the 133 exercises are completed successfully, causing all the associated unit tests to pass.
|
||||||
|
|
||||||
To complete an exercise, aider sends GPT
|
To complete an exercise, aider sends GPT
|
||||||
the initial contents of the implementation file,
|
the initial contents of the implementation file,
|
||||||
|
@ -92,10 +95,10 @@ Only use standard python libraries, don't suggest installing any packages.
|
||||||
|
|
||||||
Aider updates the implementation file based on GPT's reply and runs the unit tests.
|
Aider updates the implementation file based on GPT's reply and runs the unit tests.
|
||||||
If they all pass, we are done. If some tests fail, aider sends
|
If they all pass, we are done. If some tests fail, aider sends
|
||||||
a second message with the test error output.
|
GPT a second message with the test error output.
|
||||||
It only sends the first 50 lines of test errors, to avoid exhausting the context
|
It only sends the first 50 lines of test errors, to avoid exhausting the context
|
||||||
window of the smaller models.
|
window of the smaller models.
|
||||||
It also includes this final instruction:
|
Aider also includes this final instruction:
|
||||||
|
|
||||||
```
|
```
|
||||||
See the testing errors above.
|
See the testing errors above.
|
||||||
|
@ -134,7 +137,7 @@ format asks GPT to return an updated copy of the entire file, including any chan
|
||||||
The file should be
|
The file should be
|
||||||
formatted with normal markdown triple-backtick fences, inlined with the rest of its response text.
|
formatted with normal markdown triple-backtick fences, inlined with the rest of its response text.
|
||||||
|
|
||||||
This format is very similar to how ChatGPT returns code snippets during normal chats, except with the addition of a filename right before the opening triple backticks.
|
This format is very similar to how ChatGPT returns code snippets during normal chats, except with the addition of a filename right before the opening triple-backticks.
|
||||||
|
|
||||||
````
|
````
|
||||||
Here is the updated copy of your file demo.py:
|
Here is the updated copy of your file demo.py:
|
||||||
|
@ -212,7 +215,7 @@ original/updated style edits to be returned using the function call API.
|
||||||
## GPT-3.5 hallucinates function calls
|
## GPT-3.5 hallucinates function calls
|
||||||
|
|
||||||
GPT-3.5 is prone to ignoring the JSON Schema that specifies valid functions,
|
GPT-3.5 is prone to ignoring the JSON Schema that specifies valid functions,
|
||||||
and often returns a completely novel and syntactically
|
and often returns a completely novel and semantically
|
||||||
invalid `function_call` fragment with `"name": "python"`.
|
invalid `function_call` fragment with `"name": "python"`.
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -257,8 +260,8 @@ are therefore a bit random, depending on which variant OpenAI returns.
|
||||||
Given that, it would be ideal to run all 133 exercises many times for each
|
Given that, it would be ideal to run all 133 exercises many times for each
|
||||||
model/edit-format combination and report an average performance.
|
model/edit-format combination and report an average performance.
|
||||||
This would average away the effect of the API variance.
|
This would average away the effect of the API variance.
|
||||||
It would also significantly increase the cost of this sort of benchmarking,
|
It would also significantly increase the cost of this sort of benchmarking.
|
||||||
so I didn't do that.
|
So I didn't do that.
|
||||||
|
|
||||||
Benchmarking against 133 exercises provides some robustness all by itself, since
|
Benchmarking against 133 exercises provides some robustness all by itself, since
|
||||||
we are measuring the performance across many exercises.
|
we are measuring the performance across many exercises.
|
||||||
|
@ -266,8 +269,8 @@ we are measuring the performance across many exercises.
|
||||||
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
But to get a sense of how much the API variance impacts the benchmark outcomes,
|
||||||
I ran all 133 exercises 10 times each
|
I ran all 133 exercises 10 times each
|
||||||
against `gpt-3.5-turbo-0613` with the `whole` edit format.
|
against `gpt-3.5-turbo-0613` with the `whole` edit format.
|
||||||
You'll see one set of error bars in the graph, which demark
|
You'll see one set of error bars in the graph, which show
|
||||||
the range of results across those 10 runs.
|
the range of results from those 10 runs.
|
||||||
|
|
||||||
The OpenAI API randomness doesn't seem to
|
The OpenAI API randomness doesn't seem to
|
||||||
cause a large variance in the benchmark results.
|
cause a large variance in the benchmark results.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue