Commit graph

258 commits

Author SHA1 Message Date
Paul Gauthier
bf03f43b44 fixed benchmark 2023-12-05 11:21:11 -08:00
Paul Gauthier
6ebc142377 roughed in openai 1.x 2023-12-05 07:37:05 -08:00
Paul Gauthier
1d0bc3dcb6 Updated benchmark reports 2023-11-14 16:03:50 -08:00
Paul Gauthier
8f8a32f325 aider: Fix NameError in show_stats function by passing graphs as a parameter.
# Aider chat conversation:

USER: I ran this command:

 ./benchmark/benchmark.py tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 --stats

And got this output:

─────────────────────────────── tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 ───────────────────────────────
test-cases: 133
model: gpt-3.5-turbo-1106
edit_format: whole
commit_hash: f658a65-dirty
num_error_outputs: 2
num_user_asks: 2
num_exhausted_context_windows 0
test_timeouts: 2

41.4% correct after try 0
55.6% correct after try 1

duration: 3.1 sec/test-case
costs: $0.0037/test-case, $0.49 total, $0.49 projected
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Traceback (most recent call last):
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 772, in <module>
    app()
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 328, in __call__
    raise e
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 716, in main
    return _main(
           ^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 353, in main
    return show_stats(updated_dirnames)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 124, in show_stats
    if graphs:
       ^^^^^^
NameError: name 'graphs' is not defined

ASSISTANT: It appears that the `graphs` variable is being used in the `show_stats` function without being defined or passed as an argument to the function. To fix this issue, we need to modify the `show_stats` function to accept `graphs` as a parameter and then pass the `graphs` variable when the function is called from the `main` function.

Here are the steps to fix the issue:

1. Modify the `show_stats` function definition to accept a new parameter `graphs`.
2. Update the call to `show_stats` in the `main` function to pass the `graphs` variable.

Let's make these changes in the `benchmark/benchmark.py` file.

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
def show_stats(dirnames):
=======
def show_stats(dirnames, graphs):
>>>>>>> REPLACE
</source>

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
return show_stats(updated_dirnames)
=======
return show_stats(updated_dirnames, graphs)
>>>>>>> REPLACE
</source>

Please apply these changes to the `benchmark/benchmark.py` file to resolve the `NameError`.
2023-11-08 14:15:08 -08:00
Paul Gauthier
9351f438ac aider: Add a cmd line switch --graphs and only draw them if it's provided.
# Aider chat conversation:

USER: add a cmd line switch --graphs and only draw them if its provided

ASSISTANT: To add a command-line switch `--graphs` and conditionally generate graphs based on its presence, we need to modify the `benchmark.py` file. Specifically, we will add a new option to the `main` function to accept the `--graphs` switch and then use this option to control the calls to `plot_timing` and `plot_outcomes` functions.

Here are the steps we will follow:

1. Add the `--graphs` option to the `typer` command-line interface in the `main` function.
2. Use the value of this option to conditionally call the `plot_timing` and `plot_outcomes` functions.

Let's make these changes:

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
@app.command()
def main(
    dirnames: List[str] = typer.Argument(..., help="Directory names"),
=======
@app.command()
def main(
    dirnames: List[str] = typer.Argument(..., help="Directory names"),
    graphs: bool = typer.Option(False, "--graphs", help="Generate graphs"),
>>>>>>> REPLACE

...

<<<<<<< SEARCH
    plot_timing(df)
    plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg)
=======
    if graphs:
        plot_timing(df)
        plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg)
>>>>>>> REPLACE
</source>

With these changes, the `--graphs` switch will be available when running the `benchmark.py` script, and the graph generation will only occur if this switch is provided.
2023-11-08 14:13:59 -08:00
Paul Gauthier
ed0dc95a10 copy 2023-11-08 11:16:56 -08:00
Paul Gauthier
f2a3e489e1 labels 2023-11-08 10:26:22 -08:00
Paul Gauthier
02b604adc3 Add labels to the legend in the speed graph 2023-11-08 10:20:28 -08:00
Paul Gauthier
867a0cc33f aider: Add labels to the bars in the plot_timing function to display them correctly in the legend.
# Aider chat conversation:

ASSISTANT: To address the issue with the empty legend for the edit format in the `plot_timing` function, we need to add labels to the bars so that the legend can display them correctly. We will modify the `benchmark/benchmark.py` file to include labels for each edit format.

Here's the change that needs to be made:

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
    for i, fmt in enumerate(formats):
        edge = dict(edgecolor="#ffffff", linewidth=1.5)
        color = "#b3e6a8" if "diff" in fmt else "#b3d1e6"
        hatch = "////" if "func" in fmt else ""
        rects = ax.bar(
            pos + i * width,
            grouped[fmt],
            width * 0.95,
            color=color,
            hatch=hatch,
            zorder=zorder + 1,
            **edge,
        )
=======
    # Create a mapping for the labels and colors to ensure consistency
    format_labels = {
        'diff': 'Diff',
        'func': 'Function',
        'whole': 'Whole',
        'diff-func': 'Diff-Function',
        'diff-whole': 'Diff-Whole',
        'func-whole': 'Function-Whole',
        # Add other edit formats here if necessary
    }
    format_colors = {
        'diff': '#b3e6a8',
        'func': '#b3d1e6',
        'whole': '#e6b3b3',
        'diff-func': '#e6d1b3',
        'diff-whole': '#b3e6d1',
        'func-whole': '#d1b3e6',
        # Add other edit format colors here if necessary
    }
    format_hatches = {
        'diff': '////',
        'func': '....',
        'whole': '',
        'diff-func': '\\\\\\\\',
        'diff-whole': 'xxxx',
        'func-whole': '----',
        # Add other edit format hatches here if necessary
    }

    for i, fmt in enumerate(formats):
        label = format_labels.get(fmt, fmt)
        color = format_colors.get(fmt, '#b3d1e6')
        hatch = format_hatches.get(fmt, '')
        edge = dict(edgecolor="#ffffff", linewidth=1.5)
        rects = ax.bar(
            pos + i * width,
            grouped[fmt],
            width * 0.95,
            label=label,
            color=color,
            hatch=hatch,
            zorder=zorder + 1,
            **edge,
        )
>>>>>>> REPLACE
</source>

This change adds a mapping for labels, colors, and hatches for each edit format. It also ensures that each bar in the bar chart has a label associated with it, which will be used to generate the legend. The colors and hatches are placeholders and should be adjusted to match the desired styling.
2023-11-08 10:11:47 -08:00
Paul Gauthier
eed4864af1 Changed the label of the y-axis in the plot to "Average GPT response time per exercise (sec)". 2023-11-08 10:11:46 -08:00
Paul Gauthier
207bbbc4df aider: Implemented the plot_timing function to create a bar chart showing the average duration of each (model, edit_format) combination, with consistent styling as plot_outcomes.
# Aider chat conversation:

ASSISTANT: To implement the `plot_timing` function and style the graph similarly to `plot_outcomes`, we will follow these steps:

1. Define the `plot_timing` function to create a bar chart that shows the average duration for each combination of model and edit format.
2. Use similar styling options as used in `plot_outcomes` for consistency.

Let's start by implementing the `plot_timing` function in the `benchmark/benchmark.py` file.

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
def plot_timing(df):
    """plot a graph showing the average duration of each (model, edit_format)"""
=======
def plot_timing(df):
    """plot a graph showing the average duration of each (model, edit_format)"""
    plt.rcParams["hatch.linewidth"] = 0.5
    plt.rcParams["hatch.color"] = "#444444"

    from matplotlib import rc
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})

    fig, ax = plt.subplots(figsize=(6, 4))
    ax.grid(axis="y", zorder=0, lw=0.2)

    zorder = 1
    grouped = df.groupby(["model", "edit_format"])["avg_duration"].mean().unstack()
    num_models, num_formats = grouped.shape

    pos = np.array(range(num_models))
    width = 0.8 / num_formats

    formats = grouped.columns
    models = grouped.index

    for i, fmt in enumerate(formats):
        edge = dict(edgecolor="#ffffff", linewidth=1.5)
        color = "#b3e6a8" if "diff" in fmt else "#b3d1e6"
        hatch = "////" if "func" in fmt else ""
        rects = ax.bar(
            pos + i * width,
            grouped[fmt],
            width * 0.95,
            color=color,
            hatch=hatch,
            zorder=zorder + 1,
            **edge,
        )
        ax.bar_label(rects, padding=4, labels=[f"{v:.1f}s" for v in grouped[fmt]], size=6)

    ax.set_xticks([p + 0.5 * width for p in pos])
    ax.set_xticklabels(models)

    ax.set_ylabel("Average duration (seconds)")
    ax.set_title("GPT Code Editing - Timing")
    ax.legend(
        title="Edit Format",
        loc="upper left",
    )
    ax.set_ylim(top=max(grouped.max()) * 1.1)  # Set y-axis limit to 10% more than the max value

    plt.tight_layout()
    plt.savefig("tmp_timing.svg")
    imgcat(fig)
>>>>>>> REPLACE
</source>

This code block adds the implementation of the `plot_timing` function, which creates a bar chart to visualize the average duration of tests for each model and edit format. The styling is kept consistent with the `plot_outcomes` function, using similar colors, hatches, and font settings. The resulting graph is saved as an SVG file and displayed in the terminal using `imgcat`.
2023-11-08 10:07:29 -08:00
Paul Gauthier
ab55b6ff0a Copy ORIGINAL_DNAME to dirname. 2023-11-08 10:07:27 -08:00
Paul Gauthier
59fed25cd1 copy 2023-11-08 08:36:48 -08:00
Paul Gauthier
2c83287c46 refac 2023-11-08 06:21:16 -08:00
Paul Gauthier
5da64a6abc copy 2023-11-07 07:22:26 -08:00
Paul Gauthier
dfb921c509 copy 2023-11-06 20:24:44 -08:00
Paul Gauthier
15a1c754a4 copy 2023-11-06 18:58:24 -08:00
Paul Gauthier
35754247b4 copy 2023-11-06 18:48:14 -08:00
Paul Gauthier
976fc7a836 update benchmarking script 2023-11-06 18:26:02 -08:00
Joshua Vial
91b7648f33 update model creation method 2023-09-09 12:06:34 +12:00
Paul Gauthier
2c53c153b8 Updated ORIGINAL prompting 2023-08-09 11:38:39 -03:00
Paul Gauthier
0aa2ff2cf9 roughed in cache; elide full dirnames from test results to make them deterministic 2023-08-09 11:30:13 -03:00
Paul Gauthier
2eeee67ee5 support --keywords 2023-08-09 11:12:27 -03:00
Paul Gauthier
c43d7e9a69 initial --diffs implementation 2023-08-08 17:40:32 -03:00
Paul Gauthier
568be3812b Fix last_keyboard_interrupt in benchmark 2023-07-20 13:25:15 -03:00
Paul Gauthier
edbec5498e update benchmark driver 2023-07-13 20:02:58 -07:00
Paul Gauthier
44311f4737 preparing for v0.8.0 2023-07-06 14:03:06 -07:00
Paul Gauthier
ba2c5d05da new svg 2023-07-02 10:45:14 -07:00
Paul Gauthier
4ea70bfdc0 new svg 2023-07-02 10:35:11 -07:00
Paul Gauthier
5e4ad5bf00 new svg 2023-07-02 10:28:39 -07:00
Paul Gauthier
8f4a823855 new svg 2023-07-02 10:26:18 -07:00
Paul Gauthier
2915acb9a6 new svg 2023-07-02 10:23:06 -07:00
Paul Gauthier
86ced88f8e new svg 2023-07-02 10:18:58 -07:00
Paul Gauthier
db18876db6 plot the repeat error bars 2023-07-02 07:10:54 -07:00
Paul Gauthier
a4ab262d96 new svg 2023-07-01 21:35:38 -07:00
Paul Gauthier
f548f88220 new svg 2023-07-01 21:31:54 -07:00
Paul Gauthier
b9f8ed47f4 copy 2023-07-01 17:50:48 -07:00
Paul Gauthier
8ef166478a graph layout 2023-07-01 17:18:14 -07:00
Paul Gauthier
3f0c752368 show 1st & 2nd coding attempts 2023-07-01 16:40:25 -07:00
Paul Gauthier
f27a869649 new svg 2023-07-01 16:23:09 -07:00
Paul Gauthier
bddc5a36f0 cleanup 2023-07-01 13:29:52 -07:00
Paul Gauthier
34b4ab199d show try 1 and 2 2023-07-01 13:28:36 -07:00
Paul Gauthier
6471b89033 return 0 2023-07-01 06:50:13 -07:00
Paul Gauthier
92e5f43abe intro 2023-06-30 20:42:54 -07:00
Paul Gauthier
5942057a9c introduction.md 2023-06-30 20:39:08 -07:00
Paul Gauthier
895d3d2d67 skip graph 2023-06-30 20:36:12 -07:00
Paul Gauthier
5310ed6faf filename 2023-06-30 20:31:17 -07:00
Paul Gauthier
c745a3156f instructions_append 2023-06-30 20:30:00 -07:00
Paul Gauthier
8a0909738d stars 2023-06-30 19:27:57 -07:00
Paul Gauthier
674bfb52c6 updated graph with larger fonts 2023-06-30 19:19:23 -07:00