From 3a93da8f8dea15a30ec335c1f46a99ff85b9b727 Mon Sep 17 00:00:00 2001
From: AJ <yspdev@gmail.com>
Date: Fri, 25 Apr 2025 17:48:08 -0700
Subject: [PATCH] Add architect mode information to benchmark README

---
 benchmark/README.md | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/benchmark/README.md b/benchmark/README.md
index 5d06d7576..e269c8de1 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -84,6 +84,9 @@ You can run `./benchmark/benchmark.py --help` for a list of all the arguments, b
 - `--keywords` filters the tests to run to only the ones whose name match the supplied argument (similar to `pytest -k xxxx`).
 - `--read-model-settings=<filename.yml>` specify model settings, see here: https://aider.chat/docs/config/adv-model-settings.html#model-settings
 - `--resume` resume a previously paused benchmark run from its checkpoint
+- `--edit-format architect` run in architect mode, which uses two models: one to propose changes and another to implement them
+- `--editor-model` specify the model to use for implementing changes in architect mode
+- `--reasoning-effort` set reasoning effort for models that support it (e.g., "high", "medium", "low")
 
 ### Pausing and Resuming Benchmarks
 
@@ -149,6 +152,24 @@ should be enough to reliably reproduce any benchmark run.
 You can see examples of the benchmark report yaml in the
 [aider leaderboard data files](https://github.com/Aider-AI/aider/blob/main/aider/website/_data/).
 
+### Running benchmarks in architect mode
+
+Architect mode uses two models: a main model that proposes changes and an editor model that implements them. This can be particularly useful for models that are good at reasoning but struggle with precise code edits.
+
+Here's an example of running a benchmark in architect mode:
+
+```
+./benchmark/benchmark.py grook-mini-architect-deepseek-editor --model openrouter/x-ai/grok-3-mini-beta --editor-model openrouter/deepseek/deepseek-chat-v3-0324 --edit-format architect --threads 15 --exercises-dir polyglot-benchmark --reasoning-effort high
+```
+
+In this example:
+- The main model is Grok-3-mini-beta (via OpenRouter)
+- The editor model is DeepSeek Chat v3 (via OpenRouter)
+- The edit format is set to "architect"
+- Reasoning effort is set to "high"
+- 15 threads are used for parallel processing
+
+When running in architect mode, the benchmark report will include additional information about the editor model used.
 
 ## Limitations, notes