This commit is contained in:
Paul Gauthier 2024-09-12 15:17:32 -07:00
parent 84b1c1031a
commit 291d3509eb
2 changed files with 9 additions and 9 deletions

View file

@ -1,6 +1,6 @@
- dirname: 2024-07-18-18-57-46--gpt-4o-mini-whole - dirname: 2024-07-18-18-57-46--gpt-4o-mini-whole
test_cases: 133 test_cases: 133
model: gpt-4o-mini model: gpt-4o-mini (whole)
edit_format: whole edit_format: whole
commit_hash: d31eef3-dirty commit_hash: d31eef3-dirty
pass_rate_1: 40.6 pass_rate_1: 40.6
@ -24,7 +24,7 @@
- dirname: 2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue - dirname: 2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue
test_cases: 133 test_cases: 133
model: claude-3.5-sonnet model: claude-3.5-sonnet (diff)
edit_format: diff edit_format: diff
commit_hash: 35f21b5 commit_hash: 35f21b5
pass_rate_1: 57.1 pass_rate_1: 57.1
@ -48,7 +48,7 @@
- dirname: 2024-08-06-18-28-39--gpt-4o-2024-08-06-diff-again - dirname: 2024-08-06-18-28-39--gpt-4o-2024-08-06-diff-again
test_cases: 133 test_cases: 133
model: gpt-4o-2024-08-06 model: gpt-4o-2024-08-06 (diff)
edit_format: diff edit_format: diff
commit_hash: ed9ed89 commit_hash: ed9ed89
pass_rate_1: 57.1 pass_rate_1: 57.1
@ -72,7 +72,7 @@
- dirname: 2024-09-12-19-57-35--o1-mini-whole - dirname: 2024-09-12-19-57-35--o1-mini-whole
test_cases: 133 test_cases: 133
model: o1-mini model: o1-mini (whole)
edit_format: whole edit_format: whole
commit_hash: 36fa773-dirty, 291b456 commit_hash: 36fa773-dirty, 291b456
pass_rate_1: 49.6 pass_rate_1: 49.6

View file

@ -9,9 +9,9 @@ nav_exclude: true
# Benchmark results for OpenAI o1-mini # Benchmark results for OpenAI o1-mini
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet. OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
o1-mini scored below those models but scored below those models
when using the simple "whole" editing format. when using the "whole" editing format.
It was close enough to GPT-4o to be within the margin of error. It was close enough to GPT-4o to be within the margin of error.
The o1-mini model had trouble following the very simple whole editing format. The o1-mini model had trouble following the very simple whole editing format.
@ -21,8 +21,8 @@ the response format.
Note that o1-mini's "whole" score is compared against GPT-4o and Sonnet Note that o1-mini's "whole" score is compared against GPT-4o and Sonnet
"diff" results. "diff" results.
Using diff is more challenging for GPT-4o and Sonnet, Using diff is more challenging,
but it allows them to return search/replace blocks to but allows the model to return search/replace blocks to
efficiently edit the source code. efficiently edit the source code.
The whole format requires the o1-mini to return a fresh copy of the entire file, The whole format requires the o1-mini to return a fresh copy of the entire file,
increasing costs and latency. increasing costs and latency.