This commit is contained in:
Paul Gauthier 2024-08-26 21:18:44 -07:00
parent 4b82277ef7
commit 008b1cb5f7

View file

@ -11,12 +11,9 @@ highlight_image: /assets/sonnet-seems-fine.jpg
Recently there has been a lot of speculation that Sonnet has been Recently there has been a lot of speculation that Sonnet has been
dumbed-down, nerfed or is otherwise performing worse. dumbed-down, nerfed or is otherwise performing worse.
Sonnet seems as good as ever, when performing the
Sonnet seems as good as ever, at least when accessed via [aider code editing benchmark](/docs/benchmarks.html#the-benchmark)
the API. via the API.
As part of developing aider, I benchmark the top LLMs regularly.
I have not noticed
any degradation in Claude 3.5 Sonnet's performance at code editing.
Below is a graph showing the performance of Claude 3.5 Sonnet over time. Below is a graph showing the performance of Claude 3.5 Sonnet over time.
It shows every clean, comparable benchmark run performed since Sonnet launched. It shows every clean, comparable benchmark run performed since Sonnet launched.
@ -28,6 +25,9 @@ degradation.
There is always some variance in benchmark results, typically +/- 2% There is always some variance in benchmark results, typically +/- 2%
between runs with identical prompts. between runs with identical prompts.
It's worth noting that these results would not capture any changes
made to how Sonnet is presented in Anthropic's web chat UI.
<div class="chart-container" style="position: relative; height:400px; width:100%"> <div class="chart-container" style="position: relative; height:400px; width:100%">
<canvas id="sonnetPerformanceChart"></canvas> <canvas id="sonnetPerformanceChart"></canvas>
</div> </div>
@ -136,10 +136,10 @@ document.addEventListener('DOMContentLoaded', function() {
}); });
</script> </script>
This graph shows the performance of Claude 3.5 Sonnet on the > This graph shows the performance of Claude 3.5 Sonnet on the
[Aider's code editing benchmark](/docs/benchmarks.html#the-benchmark) [Aider's code editing benchmark](/docs/benchmarks.html#the-benchmark)
over time. 'Pass Rate 1' represents the initial success rate, while 'Pass Rate 2' shows the success rate after a second attempt with a chance to fix testing errors. > over time. 'Pass Rate 1' represents the initial success rate, while 'Pass Rate 2' shows the success rate after a second attempt with a chance to fix testing errors.
The > The
[aider LLM code editing leaderboard](https://aider.chat/docs/leaderboards/) > [aider LLM code editing leaderboard](https://aider.chat/docs/leaderboards/)
ranks models based on Pass Rate 2. > ranks models based on Pass Rate 2.