This commit is contained in:
Paul Gauthier 2024-08-26 21:18:44 -07:00
parent 4b82277ef7
commit 008b1cb5f7

View file

@ -11,12 +11,9 @@ highlight_image: /assets/sonnet-seems-fine.jpg
Recently there has been a lot of speculation that Sonnet has been
dumbed-down, nerfed or is otherwise performing worse.
Sonnet seems as good as ever, at least when accessed via
the API.
As part of developing aider, I benchmark the top LLMs regularly.
I have not noticed
any degradation in Claude 3.5 Sonnet's performance at code editing.
Sonnet seems as good as ever, when performing the
[aider code editing benchmark](/docs/benchmarks.html#the-benchmark)
via the API.
Below is a graph showing the performance of Claude 3.5 Sonnet over time.
It shows every clean, comparable benchmark run performed since Sonnet launched.
@ -28,6 +25,9 @@ degradation.
There is always some variance in benchmark results, typically +/- 2%
between runs with identical prompts.
It's worth noting that these results would not capture any changes
made to how Sonnet is presented in Anthropic's web chat UI.
<div class="chart-container" style="position: relative; height:400px; width:100%">
<canvas id="sonnetPerformanceChart"></canvas>
</div>
@ -136,10 +136,10 @@ document.addEventListener('DOMContentLoaded', function() {
});
</script>
This graph shows the performance of Claude 3.5 Sonnet on the
> This graph shows the performance of Claude 3.5 Sonnet on the
[Aider's code editing benchmark](/docs/benchmarks.html#the-benchmark)
over time. 'Pass Rate 1' represents the initial success rate, while 'Pass Rate 2' shows the success rate after a second attempt with a chance to fix testing errors.
The
[aider LLM code editing leaderboard](https://aider.chat/docs/leaderboards/)
ranks models based on Pass Rate 2.
> over time. 'Pass Rate 1' represents the initial success rate, while 'Pass Rate 2' shows the success rate after a second attempt with a chance to fix testing errors.
> The
> [aider LLM code editing leaderboard](https://aider.chat/docs/leaderboards/)
> ranks models based on Pass Rate 2.