aider/aider/website/docs/leaderboards/refactor.md
Paul Gauthier ec44850646 copy
2024-12-21 14:11:21 -08:00

2.4 KiB

parent highlight_image nav_order description
Aider LLM Leaderboards /assets/leaderboard.jpg 100 Quantitative benchmark of LLM code refactoring skill.

Refactoring leaderboard

Aider's refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model's ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo's "lazy coding" habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

{% assign refac_sorted = site.data.refactor_leaderboard | sort: 'pass_rate_1' | reverse %} {% for row in refac_sorted %} {% endfor %}
Model Percent completed correctly Percent using correct edit format Command Edit format
{{ row.model }} {{ row.pass_rate_1 }}% {{ row.percent_cases_well_formed }}% {{ row.command }} {{ row.edit_format }}