This commit is contained in:
Paul Gauthier 2025-05-07 12:59:57 -07:00
parent 3b08327792
commit c1dc473ed8
2 changed files with 43 additions and 28 deletions

View file

@ -831,7 +831,7 @@
date: 2025-04-12 date: 2025-04-12
versions: 0.81.3.dev versions: 0.81.3.dev
seconds_per_case: 45.3 seconds_per_case: 45.3
total_cost: 6.3174 total_cost: 0 # incorrect: 6.3174
- dirname: 2025-03-29-05-24-55--chatgpt4o-mar28-diff - dirname: 2025-03-29-05-24-55--chatgpt4o-mar28-diff
test_cases: 225 test_cases: 225

View file

@ -1,66 +1,81 @@
--- ---
title: Gemini 2.5 Pro Preview 0325 benchmark pricing title: Gemini 2.5 Pro Preview 03-25 benchmark cost
excerpt: The low price reported for Gemini 2.5 Pro Preview 0325 appears to be correct. excerpt: The $6.32 benchmark cost reported for Gemini 2.5 Pro Preview 03-25 was incorrect.
draft: false draft: true
nav_exclude: true nav_exclude: true
--- ---
{% if page.date %} {% if page.date %}
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p> <p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p>
{% endif %} {% endif %}
# Gemini 2.5 Pro Preview 0325 benchmark pricing # Gemini 2.5 Pro Preview 03-25 benchmark pricing
The $6.32 cost reported in the leaderboard to run the aider polyglot benchmark on The $6.32 cost reported to run the aider polyglot benchmark on
Gemini 2.5 Pro Preview 0325 was incorrect. Gemini 2.5 Pro Preview 03-25 was incorrect.
The true cost was higher, possibly significantly so. The true cost was higher, possibly significantly so.
This note shares the results of an audit and root cause analysis
relating to this error.
This note reviews and audits the original 0325 benchmark results to investigate the reported cost.
Two possible causes were identified, both related to the litellm package that Two possible causes were identified, both related to the litellm package that
aider uses to connect to LLM APIs. aider uses to connect to LLM APIs:
- The litellm model database had an incorrect price-per-token for output tokens in their database at the time of the benchmark. This does not appear to be a contributing factor to the incorrect benchmark cost. - The litellm model database had an incorrect price-per-token for Gemini 2.5 Pro Preview 03-25 in their costs database.
- The litellm package was incorrectly excluding reasoning tokens from the token counts it reported back to aider. This appears to be the cause of the incorrect benchmark cost. This does not appear to be a contributing factor to the incorrect benchmark cost.
- The litellm package was incorrectly excluding reasoning tokens from the token counts it reported to aider. This appears to be the cause of the incorrect benchmark cost.
The incorrect litellm database entry does not appear to have affected the aider benchmark costs. The incorrect litellm database entry does not appear to have affected the aider benchmark costs.
Aider maintains and uses its own database of costs for some models, and it contained Aider maintains and uses its own database of costs for some models, and it contained
the correct pricing at the time of the benchmark. the correct pricing at the time of the benchmark.
Aider appears to have Aider appears to have
loaded the correct cost data from its database and made use of it during the benchmark. loaded the correct cost data from its database and made use of it during the benchmark.
Since litellm appears to have been excluding reasoning tokens from the token counts it reported,
aider underestimated the API costs.
Litellm fixed this issue on April 21, 2025 in The version of litellm available at that time appears to have been
excluding reasoning tokens from the token counts it reported.
So even though aider had correct per-token pricing, it did not have the correct token counts
used during the benchmark.
This resulted in an underestimate of the benchmark costs.
Litellm fixed the token counting issue on April 21, 2025 in
commit [a7db0df](https://github.com/BerriAI/litellm/commit/a7db0df0434bfbac2b68ebe1c343b77955becb4b). commit [a7db0df](https://github.com/BerriAI/litellm/commit/a7db0df0434bfbac2b68ebe1c343b77955becb4b).
This fix was released in litellm v1.67.1. This fix was released in litellm v1.67.1.
Aider picked up this fix April 28, 2025 when it upgraded its litellm dependency Aider picked up this fix April 28, 2025 when it upgraded its litellm dependency
from v1.65.7 to v1.67.4.post1 from v1.65.7 to v1.67.4.post1
in commit [9351f37](https://github.com/Aider-AI/aider/commit/9351f37) in commit [9351f37](https://github.com/Aider-AI/aider/commit/9351f37).
That change shipped on May 5, 2025 in aider v0.82.3. That dependency change shipped on May 5, 2025 in aider v0.82.3.
Unfortunately, The incorrect cost has been removed from the leaderboard.
Unfortunately, the 03-25 version of Gemini 2.5 Pro Preview is no longer available,
so it is not possible to re-run the benchmark to obtain an accurate cost.
As a possibly relevant comparison, the newer 05-06 version of Gemini 2.5 Pro Preview
completed the benchmark at a cost of $41.17.
# Investigation # Investigation
Every aider benchmark report contains the git commit hash of the aider repo state used to Every aider benchmark report contains the git commit hash of the aider repository state used to
run the benchmark. run the benchmark.
The benchmark run in question was built from The
[benchmark run in question](https://github.com/Aider-AI/aider/blob/edbfec0ce4e1fe86735c915cb425b0d8636edc32/aider/website/_data/polyglot_leaderboard.yml#L814)
was built from
commit [0282574](https://github.com/Aider-AI/aider/commit/0282574). commit [0282574](https://github.com/Aider-AI/aider/commit/0282574).
Additional runs of the benchmark from that build verified that the error in litellm's Additional runs of the benchmark from that build verified that the error in litellm's
model cost database appears not to have been a factor: model cost database appears not to have been a factor:
- The local model database correctly overrides the litellm database, which contained an incorrect token cost at the time. - Aider's local model database correctly overrides the litellm database, which contained an incorrect token cost at the time.
- The correct pricing is loaded from aider's local model database and produces similar costs as the original run. - The correct pricing is loaded from aider's local model database and produces similar (incorrect) costs as the original run.
- Updating aider's local model database with an absurdly high token cost resulted in an appropriately high benchmark cost report. - Updating aider's local model database with an absurdly high token cost resulted in an appropriately high benchmark cost report, demonstrating that the local database costs were in effect.
This specific build of aider was then updated with various versions of litellm using `git biset`
to identify the first litellm commit where correct tokens counts were returned.
That build of aider was updated with various versions of litellm using `git biset`
to identify the litellm commit where the reasoning tokens were added to litellm's
token count reporting.
# Timeline # Timeline
Below is the full timeline of git commits related to this issue in the aider and litellm repositories. Below is the full timeline of git commits related to this issue in the aider and litellm repositories.
Each entry has a UTC timestamp, followed by the original literal timestamp obtained from the
relevant source.
- 2025-04-04 19:54:45 UTC (Sat Apr 5 08:54:45 2025 +1300) - 2025-04-04 19:54:45 UTC (Sat Apr 5 08:54:45 2025 +1300)
- Correct value `"output_cost_per_token": 0.000010` for `gemini/gemini-2.5-pro-preview-03-25` added to `aider/resources/model-metadata.json` - Correct value `"output_cost_per_token": 0.000010` for `gemini/gemini-2.5-pro-preview-03-25` added to `aider/resources/model-metadata.json`
@ -75,9 +90,9 @@ Below is the full timeline of git commits related to this issue in the aider and
- Commit [ac4f32f](https://github.com/BerriAI/litellm/commit/ac4f32f) in litellm. - Commit [ac4f32f](https://github.com/BerriAI/litellm/commit/ac4f32f) in litellm.
- 2025-04-12 04:55:50 UTC (2025-04-12-04-55-50 UTC) - 2025-04-12 04:55:50 UTC (2025-04-12-04-55-50 UTC)
- Benchmark performed - Benchmark performed.
- Aider repo hash [0282574 recorded in benchmark results](https://github.com/Aider-AI/aider/blob/7fbeafa1cfd4ad83f7499417837cdfa6b16fe7a1/aider/website/_data/polyglot_leaderboard.yml#L814), without a "dirty" annotation, indicating that the benchmark was run on a clean checkout of the aider repo at commit [0282574](https://github.com/Aider-AI/aider/commit/0282574). - Aider repo hash [0282574 recorded in benchmark results](https://github.com/Aider-AI/aider/blob/7fbeafa1cfd4ad83f7499417837cdfa6b16fe7a1/aider/website/_data/polyglot_leaderboard.yml#L814), without a "dirty" annotation, indicating that the benchmark was run on a clean checkout of the aider repo at commit [0282574](https://github.com/Aider-AI/aider/commit/0282574).
- Correct value `"output_cost_per_token": 0.000010` is in `aider/resources/model-metadata.json` at this commit [0282574](https://github.com/Aider-AI/aider/blob/0282574/aider/resources/model-metadata.json#L357) - Correct value `"output_cost_per_token": 0.000010` is in `aider/resources/model-metadata.json` at this commit [0282574](https://github.com/Aider-AI/aider/blob/0282574/aider/resources/model-metadata.json#L357).
- 2025-04-12 15:06:39 UTC (Apr 12 08:06:39 2025 -0700) - 2025-04-12 15:06:39 UTC (Apr 12 08:06:39 2025 -0700)
- Benchmark results added to aider repo. - Benchmark results added to aider repo.
@ -95,4 +110,4 @@ Below is the full timeline of git commits related to this issue in the aider and
- 2025-04-28 14:53:20 UTC (Mon Apr 28 07:53:20 2025 -0700) - 2025-04-28 14:53:20 UTC (Mon Apr 28 07:53:20 2025 -0700)
- Aider upgraded its litellm dependency from v1.65.7 to v1.67.4.post1, which included the reasoning token count fix. - Aider upgraded its litellm dependency from v1.65.7 to v1.67.4.post1, which included the reasoning token count fix.
- Commit [9351f37](https://github.com/Aider-AI/aider/commit/9351f37) in aider. - Commit [9351f37](https://github.com/Aider-AI/aider/commit/9351f37) in aider.
- This change shipped on May 5, 2025 in aider v0.82.3. - This dependency change shipped on May 5, 2025 in aider v0.82.3.