This commit is contained in:
Paul Gauthier 2024-11-26 14:12:16 -08:00
parent 60c29b2839
commit ab3b50296c

View file

@ -34,6 +34,8 @@ This benchmarking effort highlighted a number of pitfalls and details which
can have a significant impact on the model's ability to correctly edit code: can have a significant impact on the model's ability to correctly edit code:
- **Quantization** -- Open source models are often available at dozens of different quantizations. - **Quantization** -- Open source models are often available at dozens of different quantizations.
Most seem to only modestly decrease code editing skill, but stronger quantizations
do have a real impact.
- **Context window** -- Cloud providers can decide how large a context window to accept, - **Context window** -- Cloud providers can decide how large a context window to accept,
and they often choose differently. Ollama defaults to a tiny 2k context window, and they often choose differently. Ollama defaults to a tiny 2k context window,
and silently discards data that exceeds it. Such a small window has and silently discards data that exceeds it. Such a small window has
@ -43,9 +45,12 @@ differing output token limits. This has a direct impact on how much code the
model can write or edit in a response. model can write or edit in a response.
- **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct - **Buggy cloud providers** -- Between Qwen 2.5 Coder 32B Instruct
and DeepSeek V2.5, there were and DeepSeek V2.5, there were
multiple cloud providers with broken or buggy API endpoints that seemed multiple cloud providers with broken or buggy API endpoints.
They seemed
to be returning result different from expected based on the advertised to be returning result different from expected based on the advertised
quantization and context sizes. quantization and context sizes.
The harm caused to the code editing benchmark varied from serious
to catastrophic.
The best versions of the model rival GPT-4o, while the worst performing The best versions of the model rival GPT-4o, while the worst performing
quantization is more like the older GPT-4 Turbo. quantization is more like the older GPT-4 Turbo.