feat: Add support for using two models to complete coding tasks

2025-06-01 10:14:59 +00:00 · 2024-09-26 09:39:14 -07:00 · 2024-09-26 09:39:14 -07:00 · 222b9cff09
commit 222b9cff09
parent 89aa385613
1 changed files with 78 additions and 8 deletions
--- a/aider/website/_posts/2024-09-26-senior-junior.md
+++ b/aider/website/_posts/2024-09-26-senior-junior.md
@ -11,7 +11,82 @@ nav_exclude: true

 # Separating code reasoning and editing

-Here's a table containing the benchmark data for different model configurations:
+Aider now has experimental support for using two models to complete each coding task:
+
+- A Senior model is asked to describe how to solve the coding problem in detail.
+- A Junior model is given the Senior's solution and asked to produce specific code editing instructions to apply those changes to source files.
+
+Splitting up "code reasoning" and "code editing" has produced SOTA results on
+[aider's code editing benchmark](/docs/benchmarks.html#the-benchmark).
+
+## Motivation
+
+This approach was motivated by OpenAI's recently release o1 models.
+They are strong at reasoning, but often fail to output well formed
+code editing instructions.
+It helps to pass their solutions to a more traditional LLM,
+which can produce the specific code edits needed to update
+the existing source code file.
+
+Traditional frontier models like gpt-4o and Sonnet also
+seem to benefit from separating code reasoning and editing.
+It helps to use a pair of gpt-4o
+or a pair of Sonnet models
+in Senior/Junior configuration.
+
+The speed and costs of frontier models have been rapidly improving,
+making it more attractive to chain a pair of modern models like this.
+Chaining older LLMs would have been quite slow,
+significantly harming aider's goal of providing a rapid, interactive,
+pair programming AI coding experience.
+
+## Results
+
+The graph above and table below show the
+[aider's code editing benchmark](/docs/benchmarks.html#the-benchmark)
+score for various combinations of Senior and Junior models.
+
+Some noteworthy observations:
+
+- o1-preview with Deepseek as the Junior surprises as the SOTA result, beating other stronger Junior models. This result is obtained with Deepseek using the "whole" editing format, requiring it to output a full update copy of each edited source file. This is quite slow, and so probably not practical for interactive use with aider.
+- Pairing OpenAI's o1-preview with Anthropic's Sonnet as the Junior produces the second best result, and is an entirely practical configuration for users able to work with both providers.
+- Pairing Sonnet+Sonnet and GPT-4o+GPT-4o provides significant lift for both models, especially for GPT-4o.
+- Deepseek is surprisingly effective as a Junior model, responsible for turning proposed coding solutions into new, updated versions of the source files. Using the efficient "diff" editing format, Deepseek helps all the Senior models except for Sonnet.
+
+## Related work
+
+This approach is somewhat similar to 
+[Cursor's "Instant Apply"](https://fireworks.ai/blog/cursor) feature.
+The main differences are:
+
+- Aider can flexibly use any off the shelf model as the Junior.
+- Aider' Junior model can use the efficient "diff" editing format to specify source code changes as a series of search/replace operations. Cursor's instant apply models essentially use the "whole" edit format, asking the model to output a full, updated copy of each edited source file.
+- Cursor's apply model is highly optimized for speed and reaches 1,000 tokens/second, which mitigates the delays associated with outputting whole copies of edited files.
+
+## Try it
+
+Aider has built in defaults to support Senior/Junior coding with
+OpenAI's o1 models, gpt-4o and Anthropic's Claude 3.5 Sonnet.
+Run aider with `--senior` or get started quickly like this:
+
+```
+pip install -U aider-chat
+
+# Change directory into a git repo
+cd /to/your/git/repo
+
+# Work with Claude 3.5 Sonnet as the Senior and Junior
+export ANTHROPIC_API_KEY=your-key-goes-here
+aider --sonnet --senior
+
+# Work with OpenAI models, using gpt-4o as the Junior
+export OPENAI_API_KEY=your-key-goes-here
+aider --4o --senior
+aider --o1-mini --senior
+aider --o1-preview --senior
+```
+
+## Full results

 <style>
  .shaded td {
@ -42,8 +117,6 @@ Here's a table containing the benchmark data for different model configurations:
      <th>Junior</th>
      <th>Edit Format</th>
      <th>Pass Rate</th>
-      <th>Average Time</th>
-      <th>Total Cost</th>
    </tr>
  </thead>
  <tbody>
@ -53,16 +126,13 @@ Here's a table containing the benchmark data for different model configurations:
        <tr class="{% if group_class == 1 %}shaded{% endif %}">
          <td>{{ item.model }}</td>
          <td>{{ item.junior_model }}</td>
-          <td>{{ item.junior_edit_format | default: item.edit_format }}</td>
+          <td style="text-align: center;">{{ item.junior_edit_format | default: item.edit_format }}</td>
          <td style="text-align: right;">{{ item.pass_rate_2 }}%</td>
-          <td style="text-align: right;">{{ item.seconds_per_case }} sec</td>
-          <td style="text-align: right;">${{ item.total_cost | round: 2 }}</td>
+          <!-- <td style="text-align: right;">${{ item.total_cost | round: 2 }}</td> -->
        </tr>
      {% endfor %}
    {% endfor %}
  </tbody>
 </table>

-This table provides a comparison of different model configurations, showing their performance in terms of pass rate, processing time, and cost. The data is grouped by the Senior model and sorted by the highest Pass Rate within each group (in descending order). Within groups, rows are sorted by decreasing Pass Rate.
-