Commit graph

272 commits

Author SHA1 Message Date
Paul Gauthier
85d9d8430f aider: Color the "gpt-3.5" models green in the plotting code.
# Aider chat conversation:

USER: color the gpt-3.5 models a new color

ASSISTANT: To color the "gpt-3.5" models a new color, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-3.5" in the model name and assign a different color.

Here are the changes:

benchmark/over_time.py
```python
<<<<<<< SEARCH
    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
=======
    colors = ['red' if 'gpt-4' in model else 'green' if 'gpt-3.5' in model else 'blue' for model in models]
>>>>>>> REPLACE
```
2024-05-15 11:04:14 -07:00
Paul Gauthier
0d11357980 aider: Color the "gpt-4" models differently.
# Aider chat conversation:

USER: color the "gpt-4" models differently

ASSISTANT: To color the "gpt-4" models differently, we need to modify the plotting code in `benchmark/over_time.py` to check for "gpt-4" in the model name and assign a different color.

Here are the changes:

benchmark/over_time.py
```python
<<<<<<< SEARCH
    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)
=======
    colors = ['red' if 'gpt-4' in model else 'blue' for model in models]
    plt.scatter(dates, pass_rates, c=colors, alpha=0.5)
>>>>>>> REPLACE
```

This change will color the points red if the model name contains "gpt-4", otherwise, it will color them blue.
2024-05-15 11:03:47 -07:00
Paul Gauthier
af2f4b73e9 aider: Use the released field directly in the plot instead of parsing it with datetime.strptime.
# Aider chat conversation:

USER: I ran this command:

 python benchmark/over_time.py

And got this output:

Traceback (most recent call last):
  File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 33, in <module>
    plot_over_time('_data/edit_leaderboard.yml')
  File "/Users/gauthier/Projects/aider/benchmark/over_time.py", line 15, in plot_over_time
    dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: strptime() argument 1 must be str, not datetime.date

ASSISTANT: The error indicates that the `released` field in the YAML file is already a `datetime.date` object, so we don't need to parse it with `datetime.strptime`. We can directly use it in the plot.

Here are the changes:

benchmark/over_time.py
```python
<<<<<<< SEARCH
            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
=======
            dates.append(entry['released'])
>>>>>>> REPLACE
```
2024-05-15 11:03:07 -07:00
Paul Gauthier
49ac9ca357 aider: Added scatter plot functionality for YAML entries with a released date and pass_rate_2.
# Aider chat conversation:

USER: in over_time.py, draw a graph for each of the yaml entries with a `released` date. plot date on x, `pass_rate_2` on y and label with the `model` name. do a scatter plot. use colors/styles similar to the other graphs.

ASSISTANT: To draw a scatter plot for each of the YAML entries with a `released` date, we need to:

1. Read the YAML file.
2. Extract the `released` date and `pass_rate_2` for each entry.
3. Plot the data using matplotlib.

Here are the changes:

benchmark/over_time.py
```python
<<<<<<< SEARCH
=======
import matplotlib.pyplot as plt
import yaml
from datetime import datetime

def plot_over_time(yaml_file):
    with open(yaml_file, 'r') as file:
        data = yaml.safe_load(file)

    dates = []
    pass_rates = []
    models = []

    for entry in data:
        if 'released' in entry and 'pass_rate_2' in entry:
            dates.append(datetime.strptime(entry['released'], '%Y-%m-%d'))
            pass_rates.append(entry['pass_rate_2'])
            models.append(entry['model'])

    plt.figure(figsize=(10, 6))
    plt.scatter(dates, pass_rates, c='blue', alpha=0.5)

    for i, model in enumerate(models):
        plt.annotate(model, (dates[i], pass_rates[i]), fontsize=8, alpha=0.75)

    plt.xlabel('Release Date')
    plt.ylabel('Pass Rate 2')
    plt.title('Model Performance Over Time')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

# Example usage
plot_over_time('_data/edit_leaderboard.yml')
>>>>>>> REPLACE
```
2024-05-15 11:02:44 -07:00
Paul Gauthier
70b1c0c20c load .env in benchmark.py 2024-05-07 13:32:19 -07:00
Paul Gauthier
ecca737803 added deepseek-chat v2 2024-05-07 06:26:39 -07:00
Paul Gauthier
b1cae73a85 cleaned up csv output 2024-05-07 05:59:31 -07:00
Paul Gauthier
a7b08c7354 format output as yaml 2024-05-06 11:15:19 -07:00
Paul Gauthier
3162d42262 cleanup 2024-05-06 10:46:09 -07:00
Paul Gauthier
5fb7a323ec refactored plots 2024-05-06 10:44:34 -07:00
Paul Gauthier
3bb237bdc1 handle tasks with exceptions in the stats output 2024-05-05 08:24:45 -07:00
Paul Gauthier
9cdd9e12c3 catch all exceptions in the benchmark 2024-05-04 17:52:46 -07:00
Paul Gauthier
9b88f8caf6 updated gpt-4-0314 2024-05-04 07:59:27 -07:00
Paul Gauthier
f6580fff76 updated all openai models 2024-05-04 07:38:50 -07:00
Paul Gauthier
1981105932 aider: Implemented the TODO to extract the version from aider/__init__.py for each commit hash.
# Aider chat conversation:

USER: do the todo

ASSISTANT: Here is the *SEARCH/REPLACE block* to do the TODO in the code:

benchmark/benchmark.py
<source>python
<<<<<<< SEARCH
    for hsh in variants['commit_hash']):
        # TODO: get the output from `git show {hsh}:aider/__init__.py`
=======
    for hsh in variants['commit_hash']:
        try:
            version = subprocess.check_output(
                ['git', 'show', f'{hsh}:aider/__init__.py'],
                universal_newlines=True
            )
            version = re.search(r'__version__ = "(.*)"', version).group(1)
            csv.append(version)
        except subprocess.CalledProcessError:
            csv.append('unknown')
>>>>>>> REPLACE
</source>

This change:
1. Loops through each commit hash in `variants['commit_hash']`
2. Uses `subprocess.check_output` to run the `git show` command and capture the contents of `aider/__init__.py` at that commit
3. Searches the file contents for the `__version__ = "X.Y.Z"` line and extracts the version string
4. Appends the version to the `csv` list, or `'unknown'` if there was an error getting the file contents

The `try/except` handles cases where the commit hash doesn't exist or `aider/__init__.py` is missing.
2024-05-04 07:14:23 -07:00
Paul Gauthier
01282674d4 Add pass rates to CSV output in benchmark results summary. 2024-05-04 07:13:40 -07:00
Paul Gauthier
f0b794ce9c revert docker.sh 2024-05-03 14:46:52 -07:00
Paul Gauthier
26d841ccc7 aider: Add --add-host flag to docker run command to allow container to connect to host machine on macOS using host.docker.internal.
# Aider chat conversation:

USER: when i run `curl http://127.0.0.1:11434` inside the container, it "Couldn't connect to server". i want it to connect to the server running on the host computer on that port. the host is a mac.

ASSISTANT: To allow the Docker container to connect to the host machine on macOS, you can use the special DNS name `host.docker.internal` which resolves to the internal IP address used by the host. This is a feature provided by Docker for Mac to facilitate such connections.

Here's the diff to update the `docker.sh` script to include the `--add-host` flag, which adds an entry to the container's `/etc/hosts` file:

```diff
--- benchmark/docker.sh
+++ benchmark/docker.sh
@@ ... @@
 docker run \
        -it --rm \
        -v `pwd`:/aider \
        -v `pwd`/tmp.benchmarks/.:/benchmarks \
        -e OPENAI_API_KEY=$OPENAI_API_KEY \
        -e HISTFILE=/aider/.bash_history \
        -e AIDER_DOCKER=1 \
        -e AIDER_BENCHMARK_DIR=/benchmarks \
+       --add-host=host.docker.internal:host-gateway \
        --net=host \
        -p 11434:11434 \
        aider-benchmark \
        bash
```

With this change, when you run `curl http://host.docker.internal:11434` inside the container, it should connect to the server running on the host computer on that port.
2024-05-01 13:54:30 -07:00
Paul Gauthier
71b675d0ef Add network host mode and change port mapping in docker run script. 2024-05-01 13:54:28 -07:00
Paul Gauthier
5e4ac44008 aider: Add port forwarding for the client inside the container to connect to the Ollama server on the host.
# Aider chat conversation:

USER: add a port fwd so a client inside the contain can connect to the ollama server running on the host

ASSISTANT: To forward a port so that a client inside the container can connect to the Ollama server running on the host, you need to add a `--publish` or `-p` option to the `docker run` command in the `benchmark/docker.sh` script. This option takes the format `host_port:container_port`.

Assuming the Ollama server is running on port 12345 on the host, and you want to use the same port inside the container, the diff would look like this:

```diff
--- benchmark/docker.sh
+++ benchmark/docker.sh
@@ ... @@
 docker run \
        -it --rm \
        -v `pwd`:/aider \
        -v `pwd`/tmp.benchmarks/.:/benchmarks \
        -e OPENAI_API_KEY=$OPENAI_API_KEY \
        -e HISTFILE=/aider/.bash_history \
        -e AIDER_DOCKER=1 \
        -e AIDER_BENCHMARK_DIR=/benchmarks \
+       -p 12345:12345 \
        aider-benchmark \
        bash
```

This change will forward port 12345 from the host to port 12345 in the container, allowing a client inside the container to connect to the Ollama server running on the host machine.
2024-05-01 13:46:45 -07:00
Paul Gauthier
4461c7c4b2 fixed benchmark 2024-04-23 09:44:04 -07:00
Paul Gauthier
fd5b9bbfcb Added groq llama3 2024-04-22 07:12:01 -07:00
Paul Gauthier
434fa5f6a7 updated benchmark to new Coder & Model classes 2024-04-19 15:21:24 -07:00
Paul Gauthier
7875418183 fix column order 2024-04-09 18:11:08 -07:00
Paul Gauthier
00f1cdb561 Added gpt-4-turbo vision blog post 2024-04-09 16:55:35 -07:00
Paul Gauthier
ac39791fee fixed mislabelled gpt-4 column 2024-03-09 08:20:27 -08:00
Paul Gauthier
f5887a5098 tweaking graph labels 2024-02-03 08:25:19 -08:00
Paul Gauthier
9033be74bf Initial benchmark results for 0125 2024-01-25 13:00:16 -08:00
Joshua Vial
93f32d3855 make benchmark listen to openai_api_base env var 2023-12-21 09:38:54 +13:00
Joshua Vial
9e656945fe Merge remote-tracking branch 'upstream/main' into gpt4-vision 2023-12-21 09:29:32 +13:00
Paul Gauthier
76c1deae6a improved test for toplevel refactored func 2023-12-19 15:58:24 -08:00
Joshua Vial
d4e663f7bc benchmark work with openrouter 2023-12-20 10:27:33 +13:00
Paul Gauthier
755b3858eb copy 2023-12-19 11:11:58 -08:00
Paul Gauthier
ef2a1f3875 diff with move hint 2023-12-18 19:09:32 -08:00
Paul Gauthier
308007a8e9 laziness24-turbo-udiff-never2x 2023-12-18 18:43:15 -08:00
Paul Gauthier
e3c8fac604 copy 2023-12-18 10:20:40 -08:00
Paul Gauthier
b0c03820e9 copy 2023-12-18 10:19:38 -08:00
Paul Gauthier
16534e914b better graph 2023-12-18 10:02:52 -08:00
Paul Gauthier
6ab2db192c Added udiff graph 2023-12-18 09:53:28 -08:00
Paul Gauthier
0d4b6b6885 added missing files 2023-12-17 13:00:32 -08:00
Paul Gauthier
7113a30271 unified diffs 2023-12-17 12:54:34 -08:00
Paul Gauthier
cab7460f94 catch 404s from azure on models.list 2023-12-07 07:44:21 -08:00
Paul Gauthier
bf03f43b44 fixed benchmark 2023-12-05 11:21:11 -08:00
Paul Gauthier
6ebc142377 roughed in openai 1.x 2023-12-05 07:37:05 -08:00
Ian Brown
049eb9966d
Use expected exercism-python destination
Signed-off-by: Ian Brown <ian@zestysoft.com>
2023-11-25 12:11:12 -08:00
Paul Gauthier
1d0bc3dcb6 Updated benchmark reports 2023-11-14 16:03:50 -08:00
Paul Gauthier
8f8a32f325 aider: Fix NameError in show_stats function by passing graphs as a parameter.
# Aider chat conversation:

USER: I ran this command:

 ./benchmark/benchmark.py tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 --stats

And got this output:

─────────────────────────────── tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 ───────────────────────────────
test-cases: 133
model: gpt-3.5-turbo-1106
edit_format: whole
commit_hash: f658a65-dirty
num_error_outputs: 2
num_user_asks: 2
num_exhausted_context_windows 0
test_timeouts: 2

41.4% correct after try 0
55.6% correct after try 1

duration: 3.1 sec/test-case
costs: $0.0037/test-case, $0.49 total, $0.49 projected
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Traceback (most recent call last):
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 772, in <module>
    app()
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 328, in __call__
    raise e
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 716, in main
    return _main(
           ^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 353, in main
    return show_stats(updated_dirnames)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 124, in show_stats
    if graphs:
       ^^^^^^
NameError: name 'graphs' is not defined

ASSISTANT: It appears that the `graphs` variable is being used in the `show_stats` function without being defined or passed as an argument to the function. To fix this issue, we need to modify the `show_stats` function to accept `graphs` as a parameter and then pass the `graphs` variable when the function is called from the `main` function.

Here are the steps to fix the issue:

1. Modify the `show_stats` function definition to accept a new parameter `graphs`.
2. Update the call to `show_stats` in the `main` function to pass the `graphs` variable.

Let's make these changes in the `benchmark/benchmark.py` file.

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
def show_stats(dirnames):
=======
def show_stats(dirnames, graphs):
>>>>>>> REPLACE
</source>

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
return show_stats(updated_dirnames)
=======
return show_stats(updated_dirnames, graphs)
>>>>>>> REPLACE
</source>

Please apply these changes to the `benchmark/benchmark.py` file to resolve the `NameError`.
2023-11-08 14:15:08 -08:00
Paul Gauthier
9351f438ac aider: Add a cmd line switch --graphs and only draw them if it's provided.
# Aider chat conversation:

USER: add a cmd line switch --graphs and only draw them if its provided

ASSISTANT: To add a command-line switch `--graphs` and conditionally generate graphs based on its presence, we need to modify the `benchmark.py` file. Specifically, we will add a new option to the `main` function to accept the `--graphs` switch and then use this option to control the calls to `plot_timing` and `plot_outcomes` functions.

Here are the steps we will follow:

1. Add the `--graphs` option to the `typer` command-line interface in the `main` function.
2. Use the value of this option to conditionally call the `plot_timing` and `plot_outcomes` functions.

Let's make these changes:

<source>python
benchmark/benchmark.py
<<<<<<< SEARCH
@app.command()
def main(
    dirnames: List[str] = typer.Argument(..., help="Directory names"),
=======
@app.command()
def main(
    dirnames: List[str] = typer.Argument(..., help="Directory names"),
    graphs: bool = typer.Option(False, "--graphs", help="Generate graphs"),
>>>>>>> REPLACE

...

<<<<<<< SEARCH
    plot_timing(df)
    plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg)
=======
    if graphs:
        plot_timing(df)
        plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg)
>>>>>>> REPLACE
</source>

With these changes, the `--graphs` switch will be available when running the `benchmark.py` script, and the graph generation will only occur if this switch is provided.
2023-11-08 14:13:59 -08:00
Paul Gauthier
ed0dc95a10 copy 2023-11-08 11:16:56 -08:00
Paul Gauthier
f2a3e489e1 labels 2023-11-08 10:26:22 -08:00