Commit graph

258 commits

Author SHA1 Message Date
Paul Gauthier
7f30320566 chore: Disable pretty printing in benchmark I/O 2025-05-09 10:07:21 -07:00
Paul Gauthier (aider)
5090f28151 feat: Track total tokens and use in benchmark stats 2025-05-07 21:08:29 -07:00
Paul Gauthier (aider)
a98b531bcc feat: add prompt_tokens and completion_tokens to results summary 2025-05-07 21:02:00 -07:00
Paul Gauthier (aider)
1a4d3927e7 feat: Add --thinking-tokens option to benchmark script 2025-04-20 11:29:33 -07:00
Paul Gauthier
622bf349c5 chore: Add num_ctx and sleep to run_test_threaded.gather arguments 2025-04-17 20:08:57 -07:00
Paul Gauthier (aider)
05eaf82b36 feat: Pass verbose flag to Model class for detailed output 2025-04-17 20:02:31 -07:00
Paul Gauthier (aider)
5c8150fd16 fix: Change reasoning_effort type to string in benchmark script 2025-04-17 20:02:09 -07:00
Paul Gauthier (aider)
ec9327dcb4 style: Apply linter to benchmark.py 2025-04-17 20:01:30 -07:00
Paul Gauthier (aider)
8e689d35af Feat: Add --reasoning-effort switch to benchmark script 2025-04-17 20:01:26 -07:00
Paul Gauthier
60d11a6eba use LONG_TIMEOUT 2025-02-24 13:51:21 -08:00
Paul Gauthier
6118d91922 improve unit tests in benchmark 2025-02-06 16:27:29 -08:00
Paul Gauthier (aider)
0336a982ff feat: Add model settings loading and registration to benchmark script 2025-01-28 09:39:39 -08:00
Paul Gauthier (aider)
aa18b63c16 refactor: Simplify model settings loading in benchmark script 2025-01-28 09:38:57 -08:00
Paul Gauthier (aider)
3f890551e7 fix: Add missing read_model_settings parameter to run_test_real function 2025-01-28 09:33:14 -08:00
Paul Gauthier (aider)
823127c87e style: Apply linter formatting to benchmark.py 2025-01-28 09:32:55 -08:00
Paul Gauthier (aider)
cf2c9c6dc7 feat: Add --read-model-settings option to benchmark for loading model settings 2025-01-28 09:32:46 -08:00
Paul Gauthier
9b63b90ec4 refactor: Remove unnecessary blank line in benchmark.py 2025-01-28 09:32:35 -08:00
Paul Gauthier
dff544cd5d refactor: Split summarize method and add model metadata handling 2025-01-20 09:38:45 -08:00
Paul Gauthier
a08326ab60 enable all java tests 2025-01-15 15:18:46 -08:00
Paul Gauthier
63cf99361d ensure no loading of any other files 2025-01-15 13:57:54 -08:00
Nimesh Ghelani
ed9d70903d Fix files not being excluded in benchmark.py
`.discard()` removes an item from the set. `.difference_update()` is the
correct call here.
2025-01-07 17:35:29 +00:00
Paul Gauthier (aider)
c5919f0c15 refactor: improve cleanup error handling and verbose logging 2025-01-04 10:55:11 -08:00
Paul Gauthier
ac160cac12 chore: Ignore exceptions during Rust target directory cleanup 2025-01-04 10:55:09 -08:00
Paul Gauthier (aider)
729354b038 chore: Add cleanup for node_modules directories in benchmark tests 2025-01-03 14:19:06 -05:00
Paul Gauthier (aider)
c0be857f37 chore: Add Java build directory cleanup to test runner 2025-01-03 14:16:51 -05:00
Paul Gauthier
98b0e88ace refactor: simplify Rust target directory cleanup logic 2025-01-03 14:16:49 -05:00
Paul Gauthier (aider)
3d501df21f chore: Clean up Rust target/debug directory after all test attempts 2025-01-03 14:14:44 -05:00
Paul Gauthier
1b4abb747d style: Add blank line for readability in benchmark.py 2025-01-03 14:14:42 -05:00
Paul Gauthier (aider)
f035c4c01a fix: Remove max_apply_update_errors from threaded call 2024-12-27 16:36:58 -04:00
Paul Gauthier (aider)
8fcdcecf36 refactor: Remove deprecated max_apply_update_errors 2024-12-27 16:36:47 -04:00
Paul Gauthier
3f9ee1ac2e refactor: Remove deprecated max_apply_update_errors 2024-12-27 16:36:46 -04:00
Paul Gauthier
188e1f788d chore: Rename exercism dir to polyglot-benchmark 2024-12-27 16:33:04 -04:00
Paul Gauthier (aider)
a75507980a fix: Pass stats_languages to summarize_results and show_stats 2024-12-20 16:04:00 -08:00
Paul Gauthier (aider)
8d0decc17a style: Apply linter formatting 2024-12-20 16:03:44 -08:00
Paul Gauthier (aider)
e334cbb5d4 fix: Correct indentation in load_results function 2024-12-20 16:03:40 -08:00
Paul Gauthier (aider)
e3ac8ab19d feat: Add --stats-languages option to filter results 2024-12-20 16:03:19 -08:00
Paul Gauthier
bddf6e9017 fix: Handle missing attributes in show_stats and empty models 2024-12-20 16:03:19 -08:00
Paul Gauthier
521841b447 fix: Skip redoing tests if results exist 2024-12-19 16:25:54 -08:00
Paul Gauthier (aider)
c53cd336f9 style: Fix linting issues 2024-12-19 15:59:03 -08:00
Paul Gauthier (aider)
a8226989c8 feat: Remove @Disabled annotations from Java test files 2024-12-19 15:58:59 -08:00
Paul Gauthier
114b156d74 fix: Use relative paths for ignored files, remove redundant try 2024-12-19 15:56:16 -08:00
Paul Gauthier (aider)
370b45bb35 feat: Ignore files in .meta and .docs directories 2024-12-19 07:23:28 -08:00
Paul Gauthier
616c4a9a53 chore: Add comment about ignoring meta and docs files 2024-12-19 07:23:27 -08:00
Paul Gauthier
821f7d6694 fix: Use extra_body for reasoning_effort, fix test counts 2024-12-19 07:10:20 -08:00
Paul Gauthier
c36c06ab99 fix: Retry tests on parse or timeout, add gpt-4o params 2024-12-18 15:56:38 -08:00
Paul Gauthier
a915c60999 feat: Add pass_num to benchmark results, fix hard set percent 2024-12-18 13:36:37 -08:00
Paul Gauthier
2aa4615c78 feat: Add openrouter/openai/o1 model and update prompts 2024-12-18 06:59:14 -08:00
Paul Gauthier (aider)
7dd1346878 fix: Remove stray ] causing syntax error 2024-12-17 20:34:33 -08:00
Paul Gauthier (aider)
31f8c7d9cb fix: Handle JSON decode errors when loading results 2024-12-17 20:34:21 -08:00
Paul Gauthier
914ce0b94d feat: Add total_tests to summary, handle JSON decode errors 2024-12-17 20:34:20 -08:00