Paul Gauthier
412b8e7c3c
copy
2024-09-21 10:09:26 -07:00
Paul Gauthier
2753ac6b62
feat: Add new benchmark test case for qwen-2.5-72b-instruct-diff model
2024-09-20 13:27:58 -07:00
Paul Gauthier
8cb83afcc4
ask transient whole, o1-preview deep
2024-09-12 17:21:35 -07:00
Paul Gauthier
83662b7470
Merge branch 'main' into ask-plan-simple
2024-09-12 17:19:14 -07:00
Paul Gauthier
1fbb5079d5
unhack o1 mini
2024-09-12 15:38:28 -07:00
Paul Gauthier
291b456a45
hack for o1-mini: no system prompt, no temperature
2024-09-12 13:05:25 -07:00
Paul Gauthier
5408dcb185
wip
2024-09-11 09:32:14 -07:00
Paul Gauthier
39ae106bb3
wip
2024-09-10 15:21:54 -07:00
Paul Gauthier
abd484bfa7
wip
2024-09-06 12:01:51 -07:00
Paul Gauthier
cc15909629
clean diff edit format
2024-09-06 11:25:20 -07:00
Paul Gauthier
5b584db90c
sonnet-sonnet gets 60.2/84.2
2024-09-06 09:49:01 -07:00
Paul Gauthier
1c73e7d43a
turn off suggest shell commands during benchmarks
2024-09-05 14:35:34 -07:00
Paul Gauthier
05dcbeecac
noop
2024-09-05 14:25:09 -07:00
Paul Gauthier
ff3a75413b
sonnet+deep got 60.9/82.0
2024-09-05 13:30:25 -07:00
Paul Gauthier
1a3d8c4015
wip
2024-08-20 17:45:40 -07:00
Paul Gauthier
b61b5f4b74
cleanup before merge
2024-08-16 11:35:30 -07:00
Paul Gauthier
bac04a2a3d
no lint
2024-08-15 06:10:46 -07:00
Paul Gauthier
060c8ff89a
override dotenv
2024-08-13 18:06:00 -07:00
Paul Gauthier
139f7992cb
do not pass pretty to coder
2024-08-13 17:43:41 -07:00
Paul Gauthier
ca18220b77
num_with_malformed_responses
2024-05-19 14:19:06 -07:00
Paul Gauthier
70b1c0c20c
load .env in benchmark.py
2024-05-07 13:32:19 -07:00
Paul Gauthier
ecca737803
added deepseek-chat v2
2024-05-07 06:26:39 -07:00
Paul Gauthier
b1cae73a85
cleaned up csv output
2024-05-07 05:59:31 -07:00
Paul Gauthier
a7b08c7354
format output as yaml
2024-05-06 11:15:19 -07:00
Paul Gauthier
3162d42262
cleanup
2024-05-06 10:46:09 -07:00
Paul Gauthier
5fb7a323ec
refactored plots
2024-05-06 10:44:34 -07:00
Paul Gauthier
3bb237bdc1
handle tasks with exceptions in the stats output
2024-05-05 08:24:45 -07:00
Paul Gauthier
9cdd9e12c3
catch all exceptions in the benchmark
2024-05-04 17:52:46 -07:00
Paul Gauthier
9b88f8caf6
updated gpt-4-0314
2024-05-04 07:59:27 -07:00
Paul Gauthier
f6580fff76
updated all openai models
2024-05-04 07:38:50 -07:00
Paul Gauthier
1981105932
aider: Implemented the TODO to extract the version from aider/__init__.py for each commit hash.
...
# Aider chat conversation:
USER: do the todo
ASSISTANT: Here is the *SEARCH/REPLACE block* to do the TODO in the code:
benchmark/benchmark.py
<source>python
<<<<<<< SEARCH
for hsh in variants['commit_hash']):
# TODO: get the output from `git show {hsh}:aider/__init__.py`
=======
for hsh in variants['commit_hash']:
try:
version = subprocess.check_output(
['git', 'show', f'{hsh}:aider/__init__.py'],
universal_newlines=True
)
version = re.search(r'__version__ = "(.*)"', version).group(1)
csv.append(version)
except subprocess.CalledProcessError:
csv.append('unknown')
>>>>>>> REPLACE
</source>
This change:
1. Loops through each commit hash in `variants['commit_hash']`
2. Uses `subprocess.check_output` to run the `git show` command and capture the contents of `aider/__init__.py` at that commit
3. Searches the file contents for the `__version__ = "X.Y.Z"` line and extracts the version string
4. Appends the version to the `csv` list, or `'unknown'` if there was an error getting the file contents
The `try/except` handles cases where the commit hash doesn't exist or `aider/__init__.py` is missing.
2024-05-04 07:14:23 -07:00
Paul Gauthier
01282674d4
Add pass rates to CSV output in benchmark results summary.
2024-05-04 07:13:40 -07:00
Paul Gauthier
4461c7c4b2
fixed benchmark
2024-04-23 09:44:04 -07:00
Paul Gauthier
fd5b9bbfcb
Added groq llama3
2024-04-22 07:12:01 -07:00
Paul Gauthier
434fa5f6a7
updated benchmark to new Coder & Model classes
2024-04-19 15:21:24 -07:00
Paul Gauthier
7875418183
fix column order
2024-04-09 18:11:08 -07:00
Paul Gauthier
00f1cdb561
Added gpt-4-turbo vision blog post
2024-04-09 16:55:35 -07:00
Paul Gauthier
ac39791fee
fixed mislabelled gpt-4 column
2024-03-09 08:20:27 -08:00
Paul Gauthier
f5887a5098
tweaking graph labels
2024-02-03 08:25:19 -08:00
Paul Gauthier
9033be74bf
Initial benchmark results for 0125
2024-01-25 13:00:16 -08:00
Joshua Vial
93f32d3855
make benchmark listen to openai_api_base env var
2023-12-21 09:38:54 +13:00
Joshua Vial
9e656945fe
Merge remote-tracking branch 'upstream/main' into gpt4-vision
2023-12-21 09:29:32 +13:00
Joshua Vial
d4e663f7bc
benchmark work with openrouter
2023-12-20 10:27:33 +13:00
Paul Gauthier
755b3858eb
copy
2023-12-19 11:11:58 -08:00
Paul Gauthier
e3c8fac604
copy
2023-12-18 10:20:40 -08:00
Paul Gauthier
b0c03820e9
copy
2023-12-18 10:19:38 -08:00
Paul Gauthier
16534e914b
better graph
2023-12-18 10:02:52 -08:00
Paul Gauthier
6ab2db192c
Added udiff graph
2023-12-18 09:53:28 -08:00
Paul Gauthier
7113a30271
unified diffs
2023-12-17 12:54:34 -08:00
Paul Gauthier
cab7460f94
catch 404s from azure on models.list
2023-12-07 07:44:21 -08:00