From 2944445340fe6acab02fe8222c450696e861b1c4 Mon Sep 17 00:00:00 2001
From: Paul Gauthier <aider@paulg.org>
Date: Mon, 19 Aug 2024 20:44:48 -0700
Subject: [PATCH] copy

---
 benchmark/README.md | 49 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/benchmark/README.md b/benchmark/README.md
index 6292b18c0..b62052d40 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -98,10 +98,49 @@ collecting stats not executing unsafe python.
 ./benchmark/benchmark.py --stats tmp.benchmarks/YYYY-MM-DD-HH-MM-SS--a-helpful-name-for-this-run
 ```
 
+The benchmark report is a yaml record with statistics about the run:
+
+```yaml
+- dirname: 2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole
+  test_cases: 133
+  model: deepseek-coder V2 0724
+  edit_format: Markdown
+  commit_hash: bac04a2
+  pass_rate_1: 59.4
+  percent_cases_well_formed: 100.0
+  error_outputs: 2
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 2
+  lazy_comments: 0
+  syntax_errors: 0
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 0
+  command: aider --model deepseek-coder
+  date: 2024-08-15
+  versions: 0.50.2-dev
+  seconds_per_case: 27.9
+  total_cost: 0.0438
+```
+
+The key statistics are the `pass_rate_#` entries, which report the
+percent of the tasks which had all tests passing.
+There will be multiple of these pass rate stats,
+depending on the value of the `--tries` parameter.
+
+The yaml also includes all the settings which were in effect for the benchmark and
+the git hash of the repo. The `model`, `edit_format` and `commit_hash`
+should be enough to reliably reproduce any benchmark run.
+
+You can see examples of the benchmark report yaml in the
+[aider leaderboard data files](https://github.com/paul-gauthier/aider/blob/main/website/_data/).
+
+
 ## Limitations, notes
 
-- If you're experimenting with non-OpenAI models, the benchmarking harness may not provide enough switches/control to specify the integration to such models. You probably need to edit `benchmark.py` to instantiate `Coder()` appropriately. You can just hack this in or add new switches/config.
-- Benchmarking all 133 exercises against GPT-4 will cost about $10-20.
-- Benchmarking aider is intended for folks who are actively developing aider or doing experimental work adapting it for use with [new LLM models](https://github.com/paul-gauthier/aider/issues/172).
-- These scripts are not intended for use by typical aider users.
-- Some of the tools are written as `bash` scripts, so it will be hard to use them on Windows.
+- Benchmarking all 133 exercises against Claude 3.5 Sonnet will cost about $4.
+- Contributions of benchmark results are welcome! Submit results by opening a PR with edits to the
+[benchmark results data files](https://github.com/paul-gauthier/aider/blob/main/website/_data/).
+- These scripts are not intended for use by typical aider end users.
+- Some of these tools are written as `bash` scripts, so it will be hard to use them on Windows.