copy

2025-06-02 18:54:59 +00:00 · 2024-07-25 11:07:00 +02:00 · 2024-07-25 11:07:00 +02:00 · 255828d959
commit 255828d959
parent 34d7edf1c4
2 changed files with 71 additions and 0 deletions
--- a/aider/website/_posts/2024-07-25-new-models.md
+++ b/aider/website/_posts/2024-07-25-new-models.md
@ -0,0 +1,71 @@
 ---
 title: Coding with Llama 3.1, new DeepSeek Coder & Mistral Large
 excerpt: Summary of code editing skill for the new models, with Sonnet and GPT-3.5 for scale.
 highlight_image: /assets/2024-07-new-models.jpg
 nav_exclude: true
 ---
 {% if page.date %}
 <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
 {% endif %}
 # Coding with Llama 3.1, new DeepSeek Coder & Mistral Large
 ![Summary of code editing skill for the new models, with Sonnet and GPT-3.5 for scale.](/assets/2024-07-new-models.jpg)
 Five noteworthy models have been released in the last few days,
 with a wide range of code editing capabilities.
 Here are their results from
 [aider's code editing leaderboard](https://aider.chat/docs/leaderboards/)
 with Claude 3.5 Sonnet and the best GPT-3.5 model
 included for scale.
 - **77% claude-3.5-sonnet**
 - 73% DeepSeek Coder V2 0724
 - 66% llama-3.1-405b-instruct
 - 60% Mistral Large 2 (2407)
 - 59% llama-3.1-70b-instruct
 - **58% gpt-3.5-turbo-0301**
 - 38% llama-3.1-8b-instruct
 ## DeepSeek Coder V2 0724
 DeepSeek Coder V2 0724 was by far the biggest surprise
 and strongest code editing model, coming in 2nd on the leaderboard.
 It can
 efficiently edit code with SEARCH/REPLACE, unlike the prior
 the prior DeepSeek Coder version.
 This unlocks the ability to edit large files. 
 This Coder got 73% on the benchmark,
 very
 close to Sonnet (77%) but 20-50X less expensive!
 ## LLama 3.1
 Meta released the family of Llama 3.1 models,
 which have performed well on many evals.
 The flagship Llama 3.1 405B instruct only 
 securing #7 on aider's leaderboard.
 This is well behind frontier models like
 Claude 3.5 Sonnet & GPT-4o. 
 The 405B model can use SEARCH/REPLACE to efficiently
 edit code, but at a significant decrease in the benchmark score.
 When using this "diff" editing format, its score dropped to 64%.
 The smaller 70B model was competitive with GPT-3.5, while
 the 8B model lags far behind.
 ## Mistral Large 2 (2407)
 Mistral Large 2 (2407) scored only 60% on aider's code editing
 benchmark. 
 This puts it just ahead of the best GPT-3.5 model. 
 It
 doesn't seem able to reliably use SEARCH/REPLACE to efficiently edit
 code,
 which limits its use to small source files.
--- a/aider/website/assets/2024-07-new-models.jpg
+++ b/aider/website/assets/2024-07-new-models.jpg