mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-30 01:04:59 +00:00
copy
This commit is contained in:
parent
7972f5f4bc
commit
60c29b2839
2 changed files with 41 additions and 14 deletions
|
@ -296,4 +296,27 @@
|
||||||
date: 2024-11-24
|
date: 2024-11-24
|
||||||
versions: 0.64.2.dev
|
versions: 0.64.2.dev
|
||||||
seconds_per_case: 28.5
|
seconds_per_case: 28.5
|
||||||
total_cost: 0.1390
|
total_cost: 0.1390
|
||||||
|
|
||||||
|
- dirname: 2024-11-26-03-15-06--ollama-qwen2.5-coder:32b-instruct-fp16-2kctx
|
||||||
|
test_cases: 132
|
||||||
|
model: "Ollama: fp16, 2k ctx"
|
||||||
|
edit_format: diff
|
||||||
|
commit_hash: 68be6c5-dirty, 554d274, 2ff3a23, 2ff3a23-dirty, 61759f9, dd48b74, 3ebd47d-dirty
|
||||||
|
pass_rate_1: 43.2
|
||||||
|
pass_rate_2: 51.9
|
||||||
|
percent_cases_well_formed: 46.2
|
||||||
|
error_outputs: 171
|
||||||
|
num_malformed_responses: 165
|
||||||
|
num_with_malformed_responses: 71
|
||||||
|
user_asks: 97
|
||||||
|
lazy_comments: 2
|
||||||
|
syntax_errors: 4
|
||||||
|
indentation_errors: 0
|
||||||
|
exhausted_context_windows: 0
|
||||||
|
test_timeouts: 0
|
||||||
|
command: "aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048"
|
||||||
|
date: 2024-11-26
|
||||||
|
versions: 0.64.2.dev,0.65.1.dev
|
||||||
|
seconds_per_case: 188.6
|
||||||
|
total_cost: 0.0000
|
|
@ -30,26 +30,29 @@ served both locally and from a variety of cloud providers.
|
||||||
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
- Results from individual providers served via OpenRouter and directly to their own APIs.
|
||||||
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
- Ollama locally serving different quantizations from the [Ollama model library](https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M).
|
||||||
|
|
||||||
The best versions of the model rival GPT-4o, while the worst performer
|
|
||||||
is more like the older GPT-4 Turbo.
|
|
||||||
Suboptimal choices in quantization and token limits can
|
|
||||||
easily produce far worse results.
|
|
||||||
|
|
||||||
This benchmarking effort highlighted a number of pitfalls and details which
|
This benchmarking effort highlighted a number of pitfalls and details which
|
||||||
can have a significant impact on the model's ability to correctly edit code:
|
can have a significant impact on the model's ability to correctly edit code:
|
||||||
|
|
||||||
- Quantization -- Open source models are often available at dozens of different quantizations.
|
- **Quantization** -- Open source models are often available at dozens of different quantizations.
|
||||||
- Context window -- Cloud providers can decide how large a context window to accept,
|
- **Context window** -- Cloud providers can decide how large a context window to accept,
|
||||||
and they often choose differently. Ollama defaults to a tiny 2k context window,
|
and they often choose differently. Ollama defaults to a tiny 2k context window,
|
||||||
and silently discards data that exceeds it.
|
and silently discards data that exceeds it. Such a small window has
|
||||||
- Output token limits -- Open source models are often served with wildly
|
catastrophic effects on performance.
|
||||||
|
- **Output token limits** -- Open source models are often served with wildly
|
||||||
differing output token limits. This has a direct impact on how much code the
|
differing output token limits. This has a direct impact on how much code the
|
||||||
model can write or edit in a response.
|
model can write or edit in a response.
|
||||||
- Buggy cloud providers -- Between Qwen and DeepSeep, there were
|
- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
|
||||||
|
and DeepSeek V2.5, there were
|
||||||
multiple cloud providers with broken or buggy API endpoints that seemed
|
multiple cloud providers with broken or buggy API endpoints that seemed
|
||||||
to be returning result different from expected based on the advertised
|
to be returning result different from expected based on the advertised
|
||||||
quantization and context sizes.
|
quantization and context sizes.
|
||||||
|
|
||||||
|
The best versions of the model rival GPT-4o, while the worst performing
|
||||||
|
quantization is more like the older GPT-4 Turbo.
|
||||||
|
Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
|
||||||
|
if run with Ollama's default 2k context window.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Sections
|
### Sections
|
||||||
{: .no_toc }
|
{: .no_toc }
|
||||||
|
@ -134,9 +137,10 @@ a request that exceeds the context window.
|
||||||
Instead, it just silently truncates the request by discarding the "oldest" messages
|
Instead, it just silently truncates the request by discarding the "oldest" messages
|
||||||
in the chat to make it fit within the context window.
|
in the chat to make it fit within the context window.
|
||||||
|
|
||||||
All of the Ollama results above were collected with at least an 8k context window, which
|
Except for the single 2k context result,
|
||||||
is large enough to attempt all the coding problems in the benchmark.
|
all of the Ollama results above were collected with at least an 8k context window.
|
||||||
Aider sets Ollama's context window to 8k by default.
|
An 8k window is large enough to attempt all the coding problems in the benchmark.
|
||||||
|
Aider sets Ollama's context window to 8k by default, starting in aider v0.65.0.
|
||||||
|
|
||||||
You can change the Ollama server's context window with a
|
You can change the Ollama server's context window with a
|
||||||
[`.aider.model.settings.yml` file](https://aider.chat/docs/config/adv-model-settings.html#model-settings)
|
[`.aider.model.settings.yml` file](https://aider.chat/docs/config/adv-model-settings.html#model-settings)
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue