Paul Gauthier (aider)
|
a9401e921e
|
feat: add sleep option between tests in single-threaded mode
|
2024-12-11 13:09:45 -08:00 |
|
Paul Gauthier (aider)
|
6af71951af
|
style: fix whitespace in benchmark.py
|
2024-11-28 14:01:50 -08:00 |
|
Paul Gauthier (aider)
|
3eed45dc3e
|
fix: improve benchmark directory selection based on latest .md file timestamp
|
2024-11-28 14:01:45 -08:00 |
|
Paul Gauthier (aider)
|
320b059bc7
|
perf: optimize benchmark dir search by filtering on timestamp first
|
2024-11-28 14:00:12 -08:00 |
|
Paul Gauthier
|
a89ce06377
|
fix: correct glob pattern for finding latest benchmark directory
|
2024-11-28 14:00:10 -08:00 |
|
Paul Gauthier (aider)
|
2ff3a23606
|
fix: add num_ctx parameter to run_test_real function
|
2024-11-25 19:21:08 -08:00 |
|
Paul Gauthier (aider)
|
c5ce57ea7f
|
style: fix linting issues in benchmark.py
|
2024-11-25 19:20:49 -08:00 |
|
Paul Gauthier (aider)
|
351b8e50f0
|
feat: add --num-ctx flag to override model context window size
|
2024-11-25 19:20:43 -08:00 |
|
fry69
|
667a58052e
|
feat: change edit format from "senior" to "architect"
|
2024-09-27 09:03:42 +02:00 |
|
fry69
|
e3e0d57512
|
chore: update parameter names in args and benchmark
|
2024-09-27 08:57:22 +02:00 |
|
Paul Gauthier
|
eb21cf2830
|
architect/editor
|
2024-09-26 16:10:19 -07:00 |
|
Paul Gauthier (aider)
|
5a78e7d1b8
|
chore: Run the linter
|
2024-09-26 11:35:13 -07:00 |
|
Paul Gauthier (aider)
|
1c05192b69
|
fix: Only record junior_model and junior_edit_format in the results array if edit_format is "senior"
|
2024-09-26 11:35:09 -07:00 |
|
Paul Gauthier
|
e682eb8669
|
fix: Add junior model and junior edit format to benchmark results
|
2024-09-25 16:31:40 -07:00 |
|
Paul Gauthier (aider)
|
ed7503dbbe
|
feat: optimize find_latest_benchmark_dir to check only .md files and limit to one file per subtree
|
2024-09-25 12:20:45 -07:00 |
|
Paul Gauthier (aider)
|
e21cdafb15
|
style: run linter and fix code formatting issues
|
2024-09-25 12:18:43 -07:00 |
|
Paul Gauthier (aider)
|
8d90df1ebc
|
feat: implement automatic selection of the most recently updated benchmark directory when using --stats without dirnames
|
2024-09-25 12:18:39 -07:00 |
|
Paul Gauthier (aider)
|
24c959af2d
|
feat: Add --junior-model and --junior-edit-format flags to the benchmark
|
2024-09-25 11:44:34 -07:00 |
|
Paul Gauthier
|
15cc709322
|
feat: Improve senior coder's edit format handling
|
2024-09-25 11:42:09 -07:00 |
|
Paul Gauthier
|
65e57df7ea
|
feat: Implement changes to handle files content in Coder and prompts
|
2024-09-25 09:54:16 -07:00 |
|
Paul Gauthier
|
075bc828f6
|
stand alone junior message
|
2024-09-25 08:41:49 -07:00 |
|
Paul Gauthier
|
c912982747
|
senior-junior
|
2024-09-25 08:25:11 -07:00 |
|
Paul Gauthier
|
a9e9f9cdbe
|
Merge branch 'main' into ask-plan-simple
|
2024-09-25 07:46:15 -07:00 |
|
Paul Gauthier
|
412b8e7c3c
|
copy
|
2024-09-21 10:09:26 -07:00 |
|
Paul Gauthier
|
2753ac6b62
|
feat: Add new benchmark test case for qwen-2.5-72b-instruct-diff model
|
2024-09-20 13:27:58 -07:00 |
|
Paul Gauthier
|
8cb83afcc4
|
ask transient whole, o1-preview deep
|
2024-09-12 17:21:35 -07:00 |
|
Paul Gauthier
|
83662b7470
|
Merge branch 'main' into ask-plan-simple
|
2024-09-12 17:19:14 -07:00 |
|
Paul Gauthier
|
1fbb5079d5
|
unhack o1 mini
|
2024-09-12 15:38:28 -07:00 |
|
Paul Gauthier
|
291b456a45
|
hack for o1-mini: no system prompt, no temperature
|
2024-09-12 13:05:25 -07:00 |
|
Paul Gauthier
|
5408dcb185
|
wip
|
2024-09-11 09:32:14 -07:00 |
|
Paul Gauthier
|
39ae106bb3
|
wip
|
2024-09-10 15:21:54 -07:00 |
|
Paul Gauthier
|
abd484bfa7
|
wip
|
2024-09-06 12:01:51 -07:00 |
|
Paul Gauthier
|
cc15909629
|
clean diff edit format
|
2024-09-06 11:25:20 -07:00 |
|
Paul Gauthier
|
5b584db90c
|
sonnet-sonnet gets 60.2/84.2
|
2024-09-06 09:49:01 -07:00 |
|
Paul Gauthier
|
1c73e7d43a
|
turn off suggest shell commands during benchmarks
|
2024-09-05 14:35:34 -07:00 |
|
Paul Gauthier
|
05dcbeecac
|
noop
|
2024-09-05 14:25:09 -07:00 |
|
Paul Gauthier
|
ff3a75413b
|
sonnet+deep got 60.9/82.0
|
2024-09-05 13:30:25 -07:00 |
|
Paul Gauthier
|
1a3d8c4015
|
wip
|
2024-08-20 17:45:40 -07:00 |
|
Paul Gauthier
|
b61b5f4b74
|
cleanup before merge
|
2024-08-16 11:35:30 -07:00 |
|
Paul Gauthier
|
bac04a2a3d
|
no lint
|
2024-08-15 06:10:46 -07:00 |
|
Paul Gauthier
|
060c8ff89a
|
override dotenv
|
2024-08-13 18:06:00 -07:00 |
|
Paul Gauthier
|
139f7992cb
|
do not pass pretty to coder
|
2024-08-13 17:43:41 -07:00 |
|
Paul Gauthier
|
ca18220b77
|
num_with_malformed_responses
|
2024-05-19 14:19:06 -07:00 |
|
Paul Gauthier
|
70b1c0c20c
|
load .env in benchmark.py
|
2024-05-07 13:32:19 -07:00 |
|
Paul Gauthier
|
ecca737803
|
added deepseek-chat v2
|
2024-05-07 06:26:39 -07:00 |
|
Paul Gauthier
|
b1cae73a85
|
cleaned up csv output
|
2024-05-07 05:59:31 -07:00 |
|
Paul Gauthier
|
a7b08c7354
|
format output as yaml
|
2024-05-06 11:15:19 -07:00 |
|
Paul Gauthier
|
3162d42262
|
cleanup
|
2024-05-06 10:46:09 -07:00 |
|
Paul Gauthier
|
5fb7a323ec
|
refactored plots
|
2024-05-06 10:44:34 -07:00 |
|
Paul Gauthier
|
3bb237bdc1
|
handle tasks with exceptions in the stats output
|
2024-05-05 08:24:45 -07:00 |
|