added together_ai/qwen/Qwen2-72B-Instruct data

2025-05-31 09:44:59 +00:00 · 2024-06-08 16:43:28 -07:00 · 2024-06-08 16:43:28 -07:00 · 86ea47f791
commit 86ea47f791
parent 02c7335aa7
2 changed files with 23 additions and 14 deletions
--- a/website/_data/edit_leaderboard.yml
+++ b/website/_data/edit_leaderboard.yml
@ -474,4 +474,26 @@
  versions: 0.28.1-dev
  seconds_per_case: 17.6
  total_cost: 1.6205
-  
+
+- dirname: 2024-06-08-22-37-55--qwen2-72b-instruct-whole
+  test_cases: 133
+  model: Qwen2 72B Instruct
+  edit_format: whole
+  commit_hash: 02c7335-dirty, 1a97498-dirty
+  pass_rate_1: 44.4
+  pass_rate_2: 55.6
+  percent_cases_well_formed: 100.0
+  error_outputs: 3
+  num_malformed_responses: 0
+  num_with_malformed_responses: 0
+  user_asks: 3
+  lazy_comments: 0
+  syntax_errors: 0
+  indentation_errors: 0
+  exhausted_context_windows: 0
+  test_timeouts: 1
+  command: aider --model together_ai/qwen/Qwen2-72B-Instruct
+  date: 2024-06-08
+  versions: 0.37.1-dev
+  seconds_per_case: 14.3
+  total_cost: 0.0000
--- a/website/docs/leaderboards/index.md
+++ b/website/docs/leaderboards/index.md
@ -15,19 +15,6 @@ The leaderboards below report the results from a number of popular LLMs.
 While [aider can connect to almost any LLM](/docs/llms.html),
 it works best with models that score well on the benchmarks.

-## GPT-4o takes the #1 & #2 spots
-
-GPT-4o tops the aider LLM code editing leaderboard at 72.9%, versus 68.4% for Opus. GPT-4o takes second on aider's refactoring leaderboard with 62.9%, versus Opus at 72.3%.
-
-GPT-4o did much better than the 4-turbo models, and seems *much* less lazy.
-
-GPT-4o is also able to use aider's established "diff" edit format that uses
-`SEARCH/REPLACE` blocks.
-This diff format is used by all the other capable models, including Opus and
-the original GPT-4 models
-The GPT-4 Turbo models have all required the "udiff" edit format, due to their
-tendancy to lazy coding.
-

 ## Code editing leaderboard