![]() # Aider chat conversation: USER: I ran this command: ./benchmark/benchmark.py tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 --stats And got this output: ─────────────────────────────── tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 ─────────────────────────────── test-cases: 133 model: gpt-3.5-turbo-1106 edit_format: whole commit_hash: f658a65-dirty num_error_outputs: 2 num_user_asks: 2 num_exhausted_context_windows 0 test_timeouts: 2 41.4% correct after try 0 55.6% correct after try 1 duration: 3.1 sec/test-case costs: $0.0037/test-case, $0.49 total, $0.49 projected ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last): File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 772, in <module> app() File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 328, in __call__ raise e File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 716, in main return _main( ^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 353, in main return show_stats(updated_dirnames) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 124, in show_stats if graphs: ^^^^^^ NameError: name 'graphs' is not defined ASSISTANT: It appears that the `graphs` variable is being used in the `show_stats` function without being defined or passed as an argument to the function. To fix this issue, we need to modify the `show_stats` function to accept `graphs` as a parameter and then pass the `graphs` variable when the function is called from the `main` function. Here are the steps to fix the issue: 1. Modify the `show_stats` function definition to accept a new parameter `graphs`. 2. Update the call to `show_stats` in the `main` function to pass the `graphs` variable. Let's make these changes in the `benchmark/benchmark.py` file. <source>python benchmark/benchmark.py <<<<<<< SEARCH def show_stats(dirnames): ======= def show_stats(dirnames, graphs): >>>>>>> REPLACE </source> <source>python benchmark/benchmark.py <<<<<<< SEARCH return show_stats(updated_dirnames) ======= return show_stats(updated_dirnames, graphs) >>>>>>> REPLACE </source> Please apply these changes to the `benchmark/benchmark.py` file to resolve the `NameError`. |
||
---|---|---|
.. | ||
__init__.py | ||
benchmark.py | ||
docker.sh | ||
docker_build.sh | ||
Dockerfile | ||
plot.sh | ||
prompts.py | ||
README.md | ||
rungrid.py | ||
test_benchmark.py |
Aider code editing benchmark harness
Aider uses a "code editing" benchmark to quantitatively measure how well it works with the GPT-3.5 and GPT-4 models. This directory holds the harness and tools needed to run the benchmarking suite.
Background
The benchmark is based on the Exercism python coding exercises. This benchmark evaluates how effectively aider and GPT can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just GPT's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.
See this writeup for a longer discussion about the benchmark and how to interpret the results.
The benchmark is intended to be run inside a docker container.
This is because the benchmarking harness will be
taking code written by an LLM
and executing it without any human review or supervision!
The LLM could generate dangerous python that harms your system, like this: import os; os.system("sudo rm -rf /")
.
Running inside a docker container helps limit the damage that could be done.
Usage
There are 3 main tasks involved in benchmarking aider:
-
Install and setup for benchmarking.
-
Run the benchmark to measure performance across the 133 exercises.
-
Generate a summary report of how many of the exercises succeeded or failed.
Setup for benchmarking
First, prepare all the groundwork for running the benchmarks. These steps only need to be done once.
# Clone the aider repo
git clone git@github.com:paul-gauthier/aider.git
# Create the scratch dir to hold benchmarking results inside the main aider dir:
cd aider
mkdir tmp.benchmarks
# Clone the exercism repo
git clone git@github.com:exercism/python.git
# Copy the practice exercises into the benchmark scratch dir
cp -rp python/exercises/practice tmp.benchmarks/practice
# Build the docker container
./benchmark/docker_build.sh
Running the benchmark
Launch the docker container and run the benchmark inside it:
# Launch the docker container
./benchmark/docker.sh
# Inside the container, install aider as a development build.
# This way you're running the code that you cloned above, including any local changes.
pip install -e .
# Run the benchmark:
./benchmark/benchmark.py a-helpful-name-for-this-run --model gpt-3.5-turbo --edit-format whole --threads 10
The above will create a folder tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run
with benchmarking results.
Run like this, the script will run all 133 exercises in a random order.
You can run ./benchmark/benchmark.py --help
for a list of all the arguments, but here are the most useful to keep in mind:
--model
is the name of the model, same as you would pass directly toaider
.--edit-format
is the name of the edit format, same as you would pass directly toaider
. When working with an experimental LLM, I recommend starting withwhole
--threads
specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can speed up the process by running with more threads. 10 works well against the OpenAI APIs.--num-tests
specifies how many of the 133 tests to run before stopping. This is another way to start gently as you debug your benchmarking setup.--keywords
filters the tests to run to only the ones whose name match the supplied argument (similar topytest -k xxxx
).
Generating a benchmark report
You can generate stats about any benchmark, including ones which are still running. You don't need to run this inside the docker container, as it is just collecting stats not executing unsafe python.
# Generate stats for a specific benchmarking directory
./benchmark/benchmark.py --stats tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run
Limitations, notes
- If you're experimenting with non-OpenAI models, the benchmarking harness may not provide enough switches/control to specify the integration to such models. You probably need to edit
benchmark.py
to instantiateCoder()
appropriately. You can just hack this in or add new switches/config. - Benchmarking all 133 exercises against GPT-4 will cost about $10-20.
- Benchmarking aider is intended for folks who are actively developing aider or doing experimental work adapting it for use with new LLM models.
- These scripts are not intended for use by typical aider users.
- Some of the tools are written as
bash
scripts, so it will be hard to use them on Windows.