mirror of
https://github.com/Aider-AI/aider.git
synced 2025-06-01 10:14:59 +00:00
copy
This commit is contained in:
parent
39d5c785d5
commit
62e0cba7bd
1 changed files with 4 additions and 4 deletions
|
@ -17,7 +17,7 @@ GPT's coding ability, but also its capacity to *edit existing code*
|
||||||
and *format those code edits* so that aider can save the
|
and *format those code edits* so that aider can save the
|
||||||
edits to the local source files.
|
edits to the local source files.
|
||||||
|
|
||||||
See [this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html)
|
See [this writeup for a longer discussion about the benchmark](https://aider.chat/2024/12/21/polyglot.html).
|
||||||
|
|
||||||
The benchmark is intended to be run *inside a docker container*.
|
The benchmark is intended to be run *inside a docker container*.
|
||||||
This is because the benchmarking harness will be
|
This is because the benchmarking harness will be
|
||||||
|
@ -32,7 +32,7 @@ There are 3 main tasks involved in benchmarking aider:
|
||||||
|
|
||||||
1. Install and setup for benchmarking.
|
1. Install and setup for benchmarking.
|
||||||
|
|
||||||
2. Run the benchmark to measure performance across the 133 exercises.
|
2. Run the benchmark to measure performance across all the exercises.
|
||||||
|
|
||||||
3. Generate a summary report of how many of the exercises succeeded or failed.
|
3. Generate a summary report of how many of the exercises succeeded or failed.
|
||||||
|
|
||||||
|
@ -73,14 +73,14 @@ pip install -e .[dev]
|
||||||
```
|
```
|
||||||
|
|
||||||
The above will create a folder `tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with benchmarking results.
|
The above will create a folder `tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run` with benchmarking results.
|
||||||
Run like this, the script will run all 225 exercises in a random order.
|
Run like this, the script will run all the exercises in a random order.
|
||||||
|
|
||||||
You can run `./benchmark/benchmark.py --help` for a list of all the arguments, but here are the most useful to keep in mind:
|
You can run `./benchmark/benchmark.py --help` for a list of all the arguments, but here are the most useful to keep in mind:
|
||||||
|
|
||||||
- `--model` is the name of the model, same as you would pass directly to `aider`.
|
- `--model` is the name of the model, same as you would pass directly to `aider`.
|
||||||
- `--edit-format` is the name of the edit format, same as you would pass directly to `aider`. When working with an experimental LLM, I recommend starting with `whole`
|
- `--edit-format` is the name of the edit format, same as you would pass directly to `aider`. When working with an experimental LLM, I recommend starting with `whole`
|
||||||
- `--threads` specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can speed up the process by running with more threads. 10 works well against the OpenAI APIs.
|
- `--threads` specifies how many exercises to benchmark in parallel. Start with a single thread if you are working out the kinks on your benchmarking setup or working with a new model, etc. Once you are getting reliable results, you can speed up the process by running with more threads. 10 works well against the OpenAI APIs.
|
||||||
- `--num-tests` specifies how many of the 133 tests to run before stopping. This is another way to start gently as you debug your benchmarking setup.
|
- `--num-tests` specifies how many of the tests to run before stopping. This is another way to start gently as you debug your benchmarking setup.
|
||||||
- `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`).
|
- `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`).
|
||||||
|
|
||||||
### Benchmark report
|
### Benchmark report
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue