This commit is contained in:
Paul Gauthier 2024-08-19 20:47:03 -07:00
parent 2944445340
commit 86a7a17d47

View file

@ -101,27 +101,29 @@ collecting stats not executing unsafe python.
The benchmark report is a yaml record with statistics about the run: The benchmark report is a yaml record with statistics about the run:
```yaml ```yaml
- dirname: 2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole - dirname: 2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue
test_cases: 133 test_cases: 133
model: deepseek-coder V2 0724 model: claude-3.5-sonnet
edit_format: Markdown edit_format: diff
commit_hash: bac04a2 commit_hash: 35f21b5
pass_rate_1: 59.4 pass_rate_1: 57.1
percent_cases_well_formed: 100.0 pass_rate_2: 77.4
error_outputs: 2 percent_cases_well_formed: 99.2
num_malformed_responses: 0 error_outputs: 23
num_with_malformed_responses: 0 released: 2024-06-20
num_malformed_responses: 4
num_with_malformed_responses: 1
user_asks: 2 user_asks: 2
lazy_comments: 0 lazy_comments: 0
syntax_errors: 0 syntax_errors: 1
indentation_errors: 0 indentation_errors: 0
exhausted_context_windows: 0 exhausted_context_windows: 0
test_timeouts: 0 test_timeouts: 1
command: aider --model deepseek-coder command: aider --sonnet
date: 2024-08-15 date: 2024-07-04
versions: 0.50.2-dev versions: 0.42.1-dev
seconds_per_case: 27.9 seconds_per_case: 17.6
total_cost: 0.0438 total_cost: 3.6346
``` ```
The key statistics are the `pass_rate_#` entries, which report the The key statistics are the `pass_rate_#` entries, which report the
@ -129,8 +131,9 @@ percent of the tasks which had all tests passing.
There will be multiple of these pass rate stats, There will be multiple of these pass rate stats,
depending on the value of the `--tries` parameter. depending on the value of the `--tries` parameter.
The yaml also includes all the settings which were in effect for the benchmark and The yaml also includes all the settings which were in effect for the benchmark run and
the git hash of the repo. The `model`, `edit_format` and `commit_hash` the git hash of the repo used to run it.
The `model`, `edit_format` and `commit_hash`
should be enough to reliably reproduce any benchmark run. should be enough to reliably reproduce any benchmark run.
You can see examples of the benchmark report yaml in the You can see examples of the benchmark report yaml in the