mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-29 16:54:59 +00:00
copy
This commit is contained in:
parent
8ef166478a
commit
b9f8ed47f4
3 changed files with 325 additions and 279 deletions
File diff suppressed because it is too large
Load diff
Before Width: | Height: | Size: 74 KiB After Width: | Height: | Size: 75 KiB |
|
@ -106,8 +106,8 @@ def show_stats(dirnames):
|
||||||
for i, fmt in enumerate(formats):
|
for i, fmt in enumerate(formats):
|
||||||
if zorder > 1:
|
if zorder > 1:
|
||||||
edge = dict(
|
edge = dict(
|
||||||
edgecolor="#eeeeee",
|
edgecolor="#ffffff",
|
||||||
linewidth=2,
|
linewidth=3,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
edge = dict()
|
edge = dict()
|
||||||
|
@ -148,16 +148,16 @@ def show_stats(dirnames):
|
||||||
arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0.3"},
|
arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0.3"},
|
||||||
)
|
)
|
||||||
ax.annotate(
|
ax.annotate(
|
||||||
"Second attempt,\nafter seeing\nunittest errors",
|
"Second attempt,\nafter seeing\nunit test errors",
|
||||||
xy=(3.1, 68),
|
xy=(3.1, 68),
|
||||||
xytext=(4.25, 80),
|
xytext=(4.25, 80),
|
||||||
horizontalalignment="center",
|
horizontalalignment="center",
|
||||||
arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0.3"},
|
arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0.3"},
|
||||||
)
|
)
|
||||||
|
|
||||||
ax.set_ylabel("Percent of exercises with\nall unittests passing")
|
ax.set_ylabel("Percent of exercises with\nall unit tests passing")
|
||||||
# ax.set_xlabel("Model")
|
# ax.set_xlabel("Model")
|
||||||
ax.set_title("Code Editing Success")
|
ax.set_title("GPT Code Editing")
|
||||||
ax.legend(
|
ax.legend(
|
||||||
title="Edit Format",
|
title="Edit Format",
|
||||||
loc="upper left",
|
loc="upper left",
|
||||||
|
|
|
@ -10,18 +10,18 @@ improvements to your code.
|
||||||
|
|
||||||
The ability for GPT to reliably edit local source files is
|
The ability for GPT to reliably edit local source files is
|
||||||
crucial for this functionality.
|
crucial for this functionality.
|
||||||
Improving the reliability of code
|
Much of this depends on the "edit format", which is an important component of the
|
||||||
editing often involves modifying and experimenting with the "edit
|
system prompt.
|
||||||
format" used by aider. The edit format is a critical component of the
|
The edit format specifies how GPT should structure code edits in its
|
||||||
system prompt, dictating how GPT should structure code edits in its
|
|
||||||
responses.
|
responses.
|
||||||
|
|
||||||
Aider currently uses simple text based editing formats, but
|
Aider currently uses simple text based editing formats, but
|
||||||
[OpenAI's new function calling
|
[OpenAI's new function calling
|
||||||
API](https://openai.com/blog/function-calling-and-other-api-updates)
|
API](https://openai.com/blog/function-calling-and-other-api-updates)
|
||||||
looked like a promising way to construct a more structured editing format.
|
look like a promising way to create more structured edit formats.
|
||||||
Before making such a big change, I wanted to make
|
Before making such a big change, I wanted to make
|
||||||
sure I had a quantitative way to assess the impact on
|
sure I had a quantitative way to assess
|
||||||
|
how function based edit formats would affect
|
||||||
the reliability of code editing.
|
the reliability of code editing.
|
||||||
|
|
||||||
I developed a
|
I developed a
|
||||||
|
@ -40,8 +40,8 @@ on almost all the ChatGPT models, using a variety of edit formats.
|
||||||
The results were quite interesting:
|
The results were quite interesting:
|
||||||
|
|
||||||
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph.
|
- Asking GPT to return an updated copy of the whole file in a standard markdown fenced code block proved to be the most reliable and effective edit format across all GPT-3.5 and GPT-4 models. The results from this `whole` edit format are shown in solid blue in the graph.
|
||||||
- Using the new function calling API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
|
- Using the new functions API performed worse than the above whole file method for all models. GPT-3.5 especially produced inferior code and frequently mangled this output format. This was surprising, as the functions API was introduced to enhance the reliability of structured outputs. The results from these `...-func` edit methods are shown as patterned bars in the graph (both green and blue).
|
||||||
- The performance of the June (`0613`) version of GPT-3.5 appears to be a bit worse than the Feb (`0301`) version. This is visible if you look at the "first coding attempt" markers on the blue bars.
|
- The performance of the new June (`0613`) version of GPT-3.5 appears to be a bit worse than the February (`0301`) version. This is visible if you look at the "first coding attempt" markers on the first three blue bars.
|
||||||
- As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing.
|
- As expected, the GPT-4 models outperformed the GPT-3.5 models in code editing.
|
||||||
|
|
||||||
The quantitative benchmark results align with my intuitions
|
The quantitative benchmark results align with my intuitions
|
||||||
|
@ -115,6 +115,14 @@ Many of the exercises have multiple paragraphs of instructions,
|
||||||
and most human coders would likely fail some tests on their
|
and most human coders would likely fail some tests on their
|
||||||
first try.
|
first try.
|
||||||
|
|
||||||
|
The bars in the graph show the percent of exercises that were completed by
|
||||||
|
each model and edit format combination. The full bar height represents
|
||||||
|
the final outcome following the first coding attempt and the second
|
||||||
|
attempt that includes the unit test error output.
|
||||||
|
Each bar also has a horizontal mark that shows
|
||||||
|
the intermediate performance after the first coding attempt,
|
||||||
|
without the benefit of second try.
|
||||||
|
|
||||||
It's worth noting that GPT never gets to see the source code of the
|
It's worth noting that GPT never gets to see the source code of the
|
||||||
unit tests during the benchmarking. It only sees the error output from
|
unit tests during the benchmarking. It only sees the error output from
|
||||||
failed tests. Of course, all of this code was probably part of its
|
failed tests. Of course, all of this code was probably part of its
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue