Add architect mode information to benchmark README

This commit is contained in:
AJ 2025-04-25 17:48:08 -07:00
parent cbd744df0e
commit 3a93da8f8d

View file

@ -84,6 +84,9 @@ You can run `./benchmark/benchmark.py --help` for a list of all the arguments, b
- `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`).
- `--read-model-settings=<filename.yml>` specify model settings, see here: https://aider.chat/docs/config/adv-model-settings.html#model-settings
- `--resume` resume a previously paused benchmark run from its checkpoint
- `--edit-format architect` run in architect mode, which uses two models: one to propose changes and another to implement them
- `--editor-model` specify the model to use for implementing changes in architect mode
- `--reasoning-effort` set reasoning effort for models that support it (e.g., "high", "medium", "low")
### Pausing and Resuming Benchmarks
@ -149,6 +152,24 @@ should be enough to reliably reproduce any benchmark run.
You can see examples of the benchmark report yaml in the
[aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).
### Running benchmarks in architect mode
Architect mode uses two models: a main model that proposes changes and an editor model that implements them. This can be particularly useful for models that are good at reasoning but struggle with precise code edits.
Here's an example of running a benchmark in architect mode:
```
./benchmark/benchmark.py grook-mini-architect-deepseek-editor --model openrouter/x-ai/grok-3-mini-beta --editor-model openrouter/deepseek/deepseek-chat-v3-0324 --edit-format architect --threads 15 --exercises-dir polyglot-benchmark --reasoning-effort high
```
In this example:
- The main model is Grok-3-mini-beta (via OpenRouter)
- The editor model is DeepSeek Chat v3 (via OpenRouter)
- The edit format is set to "architect"
- Reasoning effort is set to "high"
- 15 threads are used for parallel processing
When running in architect mode, the benchmark report will include additional information about the editor model used.
## Limitations, notes