Commit graph

247 commits

Author SHA1 Message Date
Paul Gauthier
937ee478ad show with imgcat 2023-06-29 19:04:48 -07:00
Paul Gauthier
0f6a90bfed aider: Switched the plt stuff to use an explicit fig.
# Context:
USER: switch the plt stuff to use an explicit fig
ASSISTANT: To switch the matplotlib plotting to use an explicit figure, we need to modify the `show_stats` function in the `benchmark/benchmark.py` file.

Here are the changes:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
    df_grouped.unstack().plot(kind="barh", figsize=(10, 6))

    plt.xlabel("Pass Rate 1")
    plt.ylabel("Model")
    plt.title("Pass Rate 1 for each Model/Edit Format")
    plt.legend(title="Edit Format")
    plt.show()
=======
    fig, ax = plt.subplots(figsize=(10, 6))
    df_grouped.unstack().plot(kind="barh", ax=ax)

    ax.set_xlabel("Pass Rate 1")
    ax.set_ylabel("Model")
    ax.set_title("Pass Rate 1 for each Model/Edit Format")
    ax.legend(title="Edit Format")
    plt.show()
>>>>>>> UPDATED
```

This change creates a new figure and axes before plotting, and then uses the axes for setting labels, title, and legend. This is a more explicit way of handling figures and axes in matplotlib.
2023-06-29 18:58:50 -07:00
Paul Gauthier
e075fe9f0d hello, graph 2023-06-29 18:58:21 -07:00
Paul Gauthier
c7ba0a5dba aider: Modified show_stats function to plot a bar graph with the average of "pass_rate_1" for each "edit_format" within each model, grouping the bars for each model together.
# Context:
USER: don't combine the edit formats. plot a different bar for each model/edit_format combo. But group the bars for each model together.
ASSISTANT: To implement this change, we need to modify the `show_stats` function in the `benchmark/benchmark.py` file. We will use the pandas and matplotlib libraries to create the bar plot.

Here are the steps:

1. In the `show_stats` function, after creating the DataFrame `df`, sort it based on "model" and "edit_format".
2. Group the DataFrame by "model" and "edit_format" and plot a bar graph for "pass_rate_1".

Here are the changes:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
def show_stats(dirnames):
    rows = []
    for dirname in dirnames:
        row = summarize_results(dirname)
        rows.append(vars(row))

    df = pd.DataFrame.from_records(rows)
    df.sort_values(by=["model", "edit_format"], inplace=True)

    df_grouped = df.groupby("model")["pass_rate_1"].mean()
    df_grouped.plot(kind='barh', figsize=(10, 6))

    plt.xlabel('Pass Rate 1')
    plt.ylabel('Model')
    plt.title('Pass Rate 1 for each Model/Edit Format')
    plt.show()

    df.to_csv("tmp.benchmark.csv")
=======
def show_stats(dirnames):
    rows = []
    for dirname in dirnames:
        row = summarize_results(dirname)
        rows.append(vars(row))

    df = pd.DataFrame.from_records(rows)
    df.sort_values(by=["model", "edit_format"], inplace=True)

    df_grouped = df.groupby(["model", "edit_format"])["pass_rate_1"].mean()
    df_grouped.unstack().plot(kind='barh', figsize=(10, 6))

    plt.xlabel('Pass Rate 1')
    plt.ylabel('Model')
    plt.title('Pass Rate 1 for each Model/Edit Format')
    plt.legend(title='Edit Format')
    plt.show()

    df.to_csv("tmp.benchmark.csv")
>>>>>>> UPDATED
```

This will create a bar graph with the average of "pass_rate_1" for each "edit_format" within each model. The bars for each model are grouped together.
2023-06-29 18:52:08 -07:00
Paul Gauthier
d5921871f7 aider: Sort DataFrame by "model" and "edit_format" and plot a horizontal bar graph of the average "pass_rate_1" for each model.
# Context:
USER: in `show_stats` plot a horizontal bar graph. sort on "model" and "edit_format". group the bars by model. plot the pass_rate_1 for each model/edit_format.
ASSISTANT: To implement this change, we need to modify the `show_stats` function in the `benchmark/benchmark.py` file. We will use the pandas and matplotlib libraries to create the bar plot.

Here are the steps:

1. Import the matplotlib library at the top of the file.
2. In the `show_stats` function, after creating the DataFrame `df`, sort it based on "model" and "edit_format".
3. Group the DataFrame by "model" and plot a bar graph for "pass_rate_1".

Here are the changes:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
import pandas as pd
=======
import pandas as pd
import matplotlib.pyplot as plt
>>>>>>> UPDATED
```

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
def show_stats(dirnames):
    rows = []
    for dirname in dirnames:
        row = summarize_results(dirname)
        rows.append(vars(row))

    df = pd.DataFrame.from_records(rows)

    print(df)

    df.to_csv("tmp.benchmark.csv")
=======
def show_stats(dirnames):
    rows = []
    for dirname in dirnames:
        row = summarize_results(dirname)
        rows.append(vars(row))

    df = pd.DataFrame.from_records(rows)
    df.sort_values(by=["model", "edit_format"], inplace=True)

    df_grouped = df.groupby("model")["pass_rate_1"].mean()
    df_grouped.plot(kind='barh', figsize=(10, 6))

    plt.xlabel('Pass Rate 1')
    plt.ylabel('Model')
    plt.title('Pass Rate 1 for each Model/Edit Format')
    plt.show()

    df.to_csv("tmp.benchmark.csv")
>>>>>>> UPDATED
```

Please note that this will create a bar graph with the average of "pass_rate_1" for each model. If you want to plot "pass_rate_1" for each "edit_format" within each model, we would need to modify the grouping and plotting code.
2023-06-29 18:50:59 -07:00
Paul Gauthier
fe0aed2420 pull summary data into a df 2023-06-29 18:48:20 -07:00
Paul Gauthier
6706b01eb8 aider: Import the List class from the typing module in benchmark.py.
# Context:
USER: benchmark/benchmark.py:60:15: F821 undefined name 'List'
ASSISTANT: To fix the undefined name 'List' error, we need to import the List class from the typing module.

Here is the change:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
import datetime
import json
import os
import random
import re
import shutil
import subprocess
import time
from collections import defaultdict
from json.decoder import JSONDecodeError
from pathlib import Path
=======
import datetime
import json
import os
import random
import re
import shutil
import subprocess
import time
from collections import defaultdict
from json.decoder import JSONDecodeError
from pathlib import Path
from typing import List
>>>>>>> UPDATED
```
2023-06-29 18:20:15 -07:00
Paul Gauthier
4bb174d040 refac to support multiple dirs 2023-06-29 18:19:57 -07:00
Paul Gauthier
3057b077bb aider: The dirname command line argument has been modified to accept multiple directory names as a list of strings.
# Context:
USER: make the `dirname` cmd line arg accept many dirnames
ASSISTANT: To make the `dirname` command line argument accept multiple directory names, we need to modify the `main` function in the `benchmark/benchmark.py` file.

We will change the `dirname` argument from a single string to a list of strings. This will allow the user to input multiple directory names separated by spaces.

Here is the change:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
def main(
    dirname: str = typer.Argument(..., help="Directory name"),
=======
def main(
    dirname: List[str] = typer.Argument(..., help="Directory names"),
>>>>>>> UPDATED
```

Please note that this change will require modifications in the rest of the code where `dirname` is used, as it will now be a list of strings instead of a single string. We will need to iterate over the list and apply the existing logic to each directory name.

If you agree with this change, I will proceed to identify and modify the rest of the code accordingly.
2023-06-29 18:14:30 -07:00
Paul Gauthier
417305a569 AIDER_BENCHMARK_DIR 2023-06-29 18:13:54 -07:00
Paul Gauthier
3bbfebc390 fixed console.print 2023-06-29 15:34:10 -07:00
Paul Gauthier
b11a6fa7ee track num_exhausted_context_windows 2023-06-29 15:32:45 -07:00
Paul Gauthier
92c2d47e20 added docker warning 2023-06-29 14:59:38 -07:00
Paul Gauthier
5fddb9c90a removed remaining 2023-06-29 14:32:00 -07:00
Paul Gauthier
1e40dca4f1 added --cont 2023-06-29 14:22:24 -07:00
Paul Gauthier
5ee455a05b docker environment for the whole benchmark 2023-06-29 14:19:00 -07:00
Paul Gauthier
5eb1b2bdec added timeout reporting 2023-06-29 11:55:28 -07:00
Paul Gauthier
b852c35828 added back timeout 2023-06-29 11:15:51 -07:00
Paul Gauthier
ec4dc74de4 subprocess to unittest 2023-06-29 11:01:57 -07:00
Paul Gauthier
1f4e08d20c removed timeouts 2023-06-29 10:42:09 -07:00
Paul Gauthier
13a87c7700 show test output 2023-06-29 10:33:53 -07:00
Paul Gauthier
24589b543e drive unittest directly 2023-06-29 10:26:33 -07:00
Paul Gauthier
81ec0e1608 refactor timeouts 2023-06-29 09:22:11 -07:00
Paul Gauthier
4f81f88c91 Added --no-aider, better error if docker is not available 2023-06-29 07:22:53 -07:00
Paul Gauthier
3c63552628 Revert "aider: Refactor run_unit_tests to kill the subprocess when it times out."
This reverts commit 6632367978.
2023-06-29 07:08:53 -07:00
Paul Gauthier
6632367978 aider: Refactor run_unit_tests to kill the subprocess when it times out.
# Context:
USER: when `subprocess` `timeout` fires, do we need to kill the subprocess? just in `run_unit_tests`
ASSISTANT: When a subprocess times out, it is not automatically terminated. So, yes, if you want to ensure that the subprocess is terminated when it times out, you should explicitly kill it.

Here's how you can modify the `run_unit_tests` function to kill the subprocess when it times out:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
        try:
            result = subprocess.run(
                command,
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                text=True,
                timeout=timeout,
            )
            if result.returncode != 0:
                all_tests_passed = False
                print(f"Test {test_file} failed")

            res = cleanup_test_output(result.stdout)

        except subprocess.TimeoutExpired:
            all_tests_passed = False
            res = f"Test {test_file} timed out after {timeout} seconds."
=======
        with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) as proc:
            try:
                stdout, _ = proc.communicate(timeout=timeout)
                if proc.returncode != 0:
                    all_tests_passed = False
                    print(f"Test {test_file} failed")
                res = cleanup_test_output(stdout)
            except subprocess.TimeoutExpired:
                proc.kill()
                proc.wait()
                all_tests_passed = False
                res = f"Test {test_file} timed out after {timeout} seconds."
>>>>>>> UPDATED
```

This change uses `subprocess.Popen` to create the subprocess and `communicate` to interact with it. If `communicate` times out, it raises `subprocess.TimeoutExpired`, and then we kill the subprocess with `proc.kill()`.
2023-06-28 18:29:58 -07:00
Paul Gauthier
d1f0590911 move benchmark prompts into their own file 2023-06-28 18:25:39 -07:00
Paul Gauthier
0845de7a7d ssh 2023-06-28 16:01:52 -07:00
Paul Gauthier
03a5598c60 Add simple sanity check for docker 2023-06-28 15:55:42 -07:00
Paul Gauthier
06656998e1 run tests with unittest, shrink output 2023-06-28 15:52:24 -07:00
Paul Gauthier
ea83df1ea3 remove long ==--- lines to save tokens 2023-06-28 15:44:23 -07:00
Paul Gauthier
a0844839ca wip 2023-06-28 15:36:34 -07:00
Paul Gauthier
67114d1a80 Fail if docker build fails 2023-06-28 11:02:47 -07:00
Paul Gauthier
67c27fd9ef Changed --stats-only -> --stats 2023-06-28 10:56:37 -07:00
Paul Gauthier
e24d03060d rename --retries to --tries 2023-06-27 17:45:57 -07:00
Paul Gauthier
11dddad4c1 zip the hashes 2023-06-27 17:42:52 -07:00
Paul Gauthier
10049d8695 remove timing info, to avoid randomizing the response to GPT 2023-06-27 17:15:05 -07:00
Paul Gauthier
9cfebc0b60 record response hashes too 2023-06-27 16:11:05 -07:00
Paul Gauthier
9556313819 Added functionality to use pre-existing benchmark results if they exist. 2023-06-27 14:27:07 -07:00
Paul Gauthier
70050ca004 improved logic for creating benchmark dirs 2023-06-27 08:02:37 -07:00
Paul Gauthier
29d60b9ddb check for dirname on --clean 2023-06-27 07:52:48 -07:00
Paul Gauthier
7736289309 test error prompt 2023-06-27 07:49:47 -07:00
Paul Gauthier
39434289f5 updated docker image name 2023-06-26 20:52:02 -07:00
Paul Gauthier
a109ab1890 aider: Added check for existing Docker image before building.
# Context:
USER: in `build_docker` check if the image is already built and return
ASSISTANT: To check if the Docker image is already built and return if it exists, we can use the `docker images` command and parse its output. Here's how to modify the `build_docker` function:

```python
benchmark/benchmark.py
<<<<<<< ORIGINAL
def build_docker():
    command = [
        "docker",
        "build",
        "--quiet",
        "-t",
        "benchmark",
        "-f",
        "benchmark/Dockerfile",
        "/dev/null",
    ]
    print(" ".join(command))

    try:
        result = subprocess.run(
            command,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
        )
        res = result.stdout
        print(res)

    except subprocess.CalledProcessError as e:
        res = f"Failed to build Docker image: {e.output}"

    return res
=======
def build_docker():
    image_name = "benchmark"
    check_command = ["docker", "images", "-q", image_name]
    check_result = subprocess.run(check_command, stdout=subprocess.PIPE, text=True)
    if check_result.stdout.strip():
        print(f"Docker image '{image_name}' already exists, skipping build.")
        return

    command = [
        "docker",
        "build",
        "--quiet",
        "-t",
        image_name,
        "-f",
        "benchmark/Dockerfile",
        "/dev/null",
    ]
    print(" ".join(command))

    try:
        result = subprocess.run(
            command,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
        )
        res = result.stdout
        print(res)

    except subprocess.CalledProcessError as e:
        res = f"Failed to build Docker image: {e.output}"

    return res
>>>>>>> UPDATED
```

This modification adds a check for the existence of the Docker image before attempting to build it. If the image already exists, the function prints a message and returns early.
2023-06-26 20:48:55 -07:00
Paul Gauthier
5e2681ec9d call build_docker 2023-06-26 20:47:49 -07:00
Paul Gauthier
898d98fdb3 aider: Added a build_docker() function that builds a docker image called benchmark from the benchmark/Dockerfile with no other fs context.
# Context:
USER: add a build_docker() func that builds a docker image called `benchmark` from the benchmark/Dockerfile with no other fs context
2023-06-26 20:41:38 -07:00
Paul Gauthier
095f47593d run unit tests inside docker for safety 2023-06-26 20:37:55 -07:00
Renamed from scripts/benchmark.py (Browse further)