Merge branch 'main' into swe-bench

2025-05-24 14:25:00 +00:00 · 2024-05-23 13:36:23 -07:00 · 2024-05-23 13:36:23 -07:00 · fb76895eb1
commit fb76895eb1
parent 899fc88a14 17381bed58
8 changed files with 2245 additions and 5 deletions
--- a/HISTORY.md
+++ b/HISTORY.md
@ -1,6 +1,10 @@
 # Release history
 ### main
 - Aider will notice if you paste a URL into the chat, and offer to scrape it.
 ### v0.36.0
 - [Aider can now lint your code and fix any errors](https://aider.chat/2024/05/22/linting.html).
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -0,0 +1,398 @@
 ---
 title: Aider scores 26.3% on SWE Bench Lite
 excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
 highlight_image: /assets/swe_bench_lite.jpg
 draft: true
 ---
 # Aider scores 26.3% on SWE Bench Lite
 Aider scored 26.3%
 on the
 [SWE Bench Lite benchmark](https://www.swebench.com),
 achieving a state-of-the-art result. 
 The current top leaderboard entry is 20.3%
 from Amazon Q Developer Agent.
 The best result reported elsewhere seems to be
 [22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
 [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
 ## Interactive, not agentic
 Aider achieved this result mainly through its focus on static code analysis,
 reliable LLM code editing,
 and pragmatic workflows for interactive pair programming with AI.
 Aider intentionally has quite limited and narrow "agentic behavior"
 to avoid long delays, high token costs
 and the need for users to repeatedly code review incorrect solutions.
 It's also worth noting that aider currently does not use
 RAG, vector search, tools or give the LLM access to search the web
 or unilaterally execute code.
 Aider is first and foremost an interactive tool for engineers to get real work done in
 real code bases using a chat interface.
 Aider provides a pair programming experience where users can ask for a change
 and see the edits performed in real-time.
 Aider can also offer additional help like fixing lint or test errors,
 but the user is always in full interactive control.
 This lets them quickly steer misunderstandings back on course and
 avoid wasting time and token costs.
 ## Benchmark methodology
 For the benchmark, 
 aider was launched in each problem's git repository
 with the problem statement
 submitted as the opening chat message from "the user."
 After that aider runs as normal, with the following modifications:
 - Aider's suggestions were always accepted without user approval.
 - A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
 Plausibly correct means that aider concluded that it had successfully edited the repo
 without causing syntax errors or breaking any *pre-existing* tests.
 - If the solution isn't plausible, the harness launches aider to try again from scratch,
 alternating between using aider with GPT-4o and Opus.
 - If no plausible solution is found after six tries, the harness picks the solution
 with the least amount of edit/lint/test problems.
 It's important to be clear that
 *aider and the benchmark harness
 only had access to the pre-existing tests in each problem's repo*.
 They could not see or run the held out "acceptance tests" that are used
 after benchmarking to see if the
 SWE Bench problem was correctly resolved.
 The benchmarking process was similar to how a developer might use aider to
 resolve a GitHub issue:
 - They could launch aider in their repo with the command below, which
 tells aider they want to accept every suggestion
 and to use pytest to run tests.
  - `aider --yes --test-cmd pytest`
 - They could start the chat by pasting in the URL or text of a GitHub issue.
 Aider will pull in the URL's content and then try and solve the issue.
 - If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
 [Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
 so it's always easy to revert AI changes that don't pan out.
 Outside a benchmark setting, it's probably
 unwise or at least highly inefficient
 to let *any* AI agent run unsupervised on your code base.
 The reason aider is intended to be used interactively
 is so that the user can participate and direct aider's work and approve suggestions.
 This way the user can offer immediate feedback or corrections if their initial
 instructions turn out to be ambiguous,
 or if the AI starts going down a wrong path.
 ## Aider with GPT-4o alone was SOTA
 Running the benchmark harness
 only using aider with GPT-4o to find plausible solutions
 achieved a score of 25.0%.
 This was itself a state-of-the-art result, before being surpassed by the main
 result being reported here
 that used aider with both GPT-4o & Opus.
 As noted below, a single attempt using Aider with GPT-4o tied
 the current top entry on the leaderboard.
 ## Aider with GPT-4o & Opus
 The benchmark harness alternated between running aider with GPT-4o and Opus.
 The harness proceeded in a fixed order, always starting with GPT-4o and
 then alternating with Opus until a plausible solution was found for each
 problem.
 The table below breaks down the plausible solutions that
 were found for the 300 problems.
 It also provides details on the 79 that were ultimately
 verified as correctly resolving their issue.
 Some noteworthy observations:
 - *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
 - Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
 These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
 - A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
 | Attempt | Agent |Number&nbsp;of<br>plausible<br>solutions|Percent&nbsp;of<br>plausible<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite<br>(resolved/300) |
 |:--------:|------------|---------:|---------:|----:|---:|--:|
 | 1 | Aider with GPT-4o    | 208 | 69.3% | 61 | 77.2% | 20.3% |
 | 2 | Aider with Opus      |  49 | 16.3% | 10 | 12.7% |  3.3% |
 | 3 | Aider with GPT-4o    |  20 |  6.7% |  3 |  3.8% |  1.0% |
 | 4 | Aider with Opus      |   9 |  3.0% |  2 |  2.5% |  0.7% |
 | 5 | Aider with GPT-4o    |  11 |  3.7% |  2 |  2.5% |  0.7% |
 | 6 | Aider with Opus      |   3 |  1.0% |  1 |  1.3% |  0.3% |
 | **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
 If we break down correct solutions purely by model,
 we can see that aider with GPT-4o outperforms Opus.
 This isn't a fair and direct comparison, because GPT-4o always took the first
 turn and therefore got first crack at all the "easiest" problems.
 Aider with Opus only ever saw problems that GPT-4o failed to
 find plausible solutions for on its first try.
 Aider with GPT-4o was producing higher quality plausible solutions,
 with a greater chance of going on to be accepted as resolving the issue.
 Again, this is biased by the turn ordering.
 But other anecdotal evidence from earlier runs of the benchmark
 also supports the observation that aider with GPT-4o is significantly stronger than Opus
 for this benchmark.
 | Agent      | Number&nbsp;of<br>plausible<br>solutions | Number&nbsp;of<br>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>plausible<br>which<br>correctly<br>resolved<br>| 
 |------------|---------:|---------:|---:|
 | Aider with GPT-4o    | 239 | 66 |27.6% |
 | Aider with Opus      |  61 | 13 |21.3% |
 | **Total** | **300** | **79** |**26.3%** |
 ## Repository map, not RAG
 The crucial first step in solving a SWE Bench problem is figuring out
 which parts of the repo are relevant and which files need to be edited.
 Most coding agents use some combination of RAG, vector search
 and providing the LLM with
 tools to interactively explore the code base.
 Aider instead uses a
 [repository map](https://aider.chat/2023/10/22/repomap.html)
 to help the LLM understand the 
 layout, code structure, and content of a git repo.
 The repo map is created through static analysis of the code's
 abstract syntax tree and call graph
 to provide a compact and powerful summary of the entire code base.
 The map is constantly
 tailored to show
 repo context that is relevant to the current state of the chat conversation.
 This is done by performing a graph optimization on the code's call graph.
 When the user asks for a change to their code, the LLM can use the repo map
 to decide which files to edit.
 The LLM simply returns a normal text response explaining which files
 it needs to edit and why.
 Aider notices when the LLM mentions filenames from the repo,
 and asks the user if they should be added to the chat.
 Adding a file to the chat allows the LLM to see the full contents
 of the file and edit it.
 <div class="chat-transcript" markdown="1">
 #### Please add a new /factorial/N endpoint.
 To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
 Please add app.py to the chat so I can proceed with the changes.
 > app.py  
 > Add these files to the chat? yes
 </div>
 This is a convenient and natural workflow for interactive chat,
 and it worked well for the SWE Bench problems.
 Aider successfully identified the correct file to edit
 in 70.3% of the benchmark tasks.
 We can determine which file needed to be edited using the "gold" patch
 which is associated with each SWE Bench task.
 This patch was created by a human developer
 to solve the issue, and therefore reveals a file which can
 be edited to solve the problem.
 Of course aider is not able to see or use the gold patch
 or the file names it contains in any way.
 This information was only used to compute
 statistics outside the benchmarking process.
 ## Reliable code editing
 Once files have been selected for editing,
 the next step is of course to edit the source code to fix the problem.
 Aider goes to great lengths to ensure that LLMs can not just write code,
 but reliably *edit* code.
 Aider has a collection of prompting strategies and code editing backends which have
 been honed through
 [extensive benchmarking](https://aider.chat/docs/leaderboards/).
 These foundational capabilities help ensure that aider can
 properly integrate code from LLMs into an existing code base and source files.
 The repository map helps here too, making sure that the LLM
 can see relevant classes, functions and variables from the entire repo.
 This helps ensure that the project's existing APIs and conventions are
 respected and utilized when new code is added.
 Regardless, there are still cases where aider may be unable to cleanly
 complete the edits specified by the LLM.
 This is usually because the LLM has failed to conform to the editing
 instructions in its system prompt.
 When aider completes, it returns an editing outcome that indicates
 whether it was able to successfully complete all edits.
 The benchmark harness uses this editing status as
 one criteria to determine if aider has
 created a plausible solution.
 ## Linting and fixing
 Another key criteria for a plausible solution is that it passes basic
 linting, which means that the code is valid and without syntax
 or other fatal errors.
 [Aider lints code](https://aider.chat/2024/05/22/linting.html)
 after every LLM edit and offers to automatically fix
 any problems.
 Aider ships with built-in linters based on tree-sitter
 which work with most popular programming languages.
 Aider shows linting errors to the LLM in a novel format,
 using the abstract syntax tree to display relevant code context for each
 error.
 This context helps LLMs understand the problem and
 make the correct changes to resolve it.
 <div class="chat-transcript" markdown="1">
 ```
 app.py:23:36: F821 undefined name 'num'  
 app.py:  
 ...⋮...  
  6│class LongNum:  
 ...⋮...  
 19│    def expound(self, threshold):  
 20│        number = self.basis  
 21│        while number < threshold:  
 22│            number *= self.factor  
 23█        return num  
 24│  
 25│  
 ...⋮...  
 ```  
 > Attempt to fix lint errors? yes
 </div>
 In the benchmark, these linting suggestions are always accepted.
 At completion,
 aider reports a linting outcome that
 indicates if it was able to produce
 code without any outstanding linting errors.
 The benchmark harness uses this status as
 one of the criteria to determine if aider has
 created a plausible solution.
 ## Testing and fixing
 The final crtieria for a plausible solution is that 
 all tests must be passing.
 Aider can be configured with the command to run tests for a repo,
 and will automatically attempt to fix any test failures.
 A user working on a python project might configure testing
 by launching aider like this:
 ```
 aider --test-cmd pytest
 ``` 
 For the benchmark, aider is configured with a test command that will run the
 tests that already exist in each problem's repository.
 SWE Bench problems are based on repositories from large open
 source projects with extensive existing test suites.
 This means that
 testing will fail if aider has broken any of these
 pre-existing tests or if any new
 tests that it created aren't passing.
 As with editing and linting, aider reports a testing outcome
 that indicates if it completed with any outstanding failing tests.
 The benchmark harness uses this status when deciding if aider
 has produced a plausible solution.
 To be clear, *aider cannot run or even see the held out "acceptance tests"* that
 are used to determine if a proposed solution correctly
 resolves the problem.
 Those tests are only run outside of aider and the benchmark harness,
 to compute the final benchmark score.
 ## Finding a plausible solution
 Each time aider executes, it reports
 the outcome of the editing, linting, and testing
 steps.
 Each of these steps may complete successfully or
 return a status that indicates that there were outstanding
 problems that remain unresolved.
 The benchmark harness uses these outcomes to determine if
 aider has produced a plausible
 solution to the current SWE Bench task.
 A plausible solution is one where aider
 returns saying that it 
 edited the repo with no outstanding
 edit, lint, or test errors.
 In this case, aider's changes are recorded
 as the SWE Bench `model_patch` to be evaluated later with the
 acceptance tests.
 If the solution is not plausible, another
 instance of aider is launched again from scratch on the same problem.
 The harness alternates launching aider with GPT-4o and Opus to solve the problem,
 and gives each model three attempts -- for a total of six attempts.
 As soon as a plausible solution is found, it is accepted and the
 harness moves on to the next SWE Bench instance.
 It's worth noting that repositories may have lint or test errors
 present before aider even starts to edit them.
 Whether unresolved errors were caused by aider or were pre-existing,
 there will be instances where
 no plausible solution is
 found after six tries.
 If all six attempts fail to produce a plausible solution,
 then the "best" solution available is selected as the
 `model_patch`.
 Which of the non-plausible solutions to use is determined
 by ignoring the testing outcome
 and prioritizing solutions in the following order:
 - Pick a solution where editing and linting were completed successfully.
 - Pick a solution where editing was at least partially successful and linting succeeded.
 - Pick a solution where editing was successful.
 - Pick a solution where editing was at least partially successful.
 ## Computing the benchmark score
 The benchmark harness produces a candidate solution for each of the 300
 SWE Bench Lite instances and saves it as the `model_patch`.
 A separate evaluation script 
 tests each of these solutions with the full test suite
 including the held out acceptance tests.
 For this final acceptance testing, any edits that aider made to tests
 are discarded.
 This ensures that the full, correct test suite is used for acceptance testing.
 The evaluation script compares the test results
 with results from testing
 the "gold" patch that was developed by a human to correctly solve the issue.
 If they match, the candidate solution has correctly resolved the issue.
 These acceptance tests are only ever run outside of aider
 and the benchmark harness, and only to compute the number of
 correctly resolved instances.
 They are never run, used, or even visible during aider's attempts to solve the problems.
 Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
 ## Acknowledgments
 Much thanks to the team behind the
 [SWE Bench](https://www.swebench.com)
 family of AI coding benchmarks.
 Also thanks to Albert Örwall who has
 [dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
 making it faster, easier, and more reliable to run the acceptance tests.
--- a/aider/coders/base_coder.py
+++ b/aider/coders/base_coder.py
@ -618,6 +618,20 @@ class Coder:
            return self.commands.run(inp)
        self.check_for_file_mentions(inp)
        inp = self.check_for_urls(inp)
        return inp
    def check_for_urls(self, inp):
        url_pattern = re.compile(
            r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
        )
        urls = url_pattern.findall(inp)
        for url in urls:
            if self.io.confirm_ask(f"Add {url} to the chat?"):
                inp += "\n\n"
                inp += self.commands.cmd_web(url)
        return inp
    def keyboard_interrupt(self):
--- a/aider/commands.py
+++ b/aider/commands.py
@ -69,8 +69,8 @@ class Commands:
            self.scraper = Scraper(print_error=self.io.tool_error)
        content = self.scraper.scrape(url) or ""
-        if content:
+        # if content:
-            self.io.tool_output(content)
+        #    self.io.tool_output(content)
        instructions = self.scraper.get_playwright_instructions()
        if instructions:
--- a/assets/swe_bench_lite.jpg
+++ b/assets/swe_bench_lite.jpg
--- a/assets/swe_bench_lite.svg
+++ b/assets/swe_bench_lite.svg
--- a/benchmark/over_time.py
+++ b/benchmark/over_time.py
@ -22,6 +22,7 @@ def plot_over_time(yaml_file):
    plt.rcParams["hatch.color"] = "#444444"
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
    plt.rcParams["text.color"] = "#444444"
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.grid(axis="y", zorder=0, lw=0.2)
@ -44,10 +45,12 @@ def plot_over_time(yaml_file):
            textcoords="offset points",
        )
-    ax.set_xlabel("Model release date", fontsize=18)
+    ax.set_xlabel("Model release date", fontsize=18, color="#555")
-    ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18)
+    ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18, color="#555")
    ax.set_title("LLM code editing skill by model release date", fontsize=20)
-    plt.tight_layout()
+    ax.set_ylim(0, 30)
    plt.xticks(fontsize=14)
    plt.tight_layout(pad=3.0)
    plt.savefig("tmp_over_time.png")
    plt.savefig("tmp_over_time.svg")
    imgcat(fig)
--- a/benchmark/swe_bench_lite.py
+++ b/benchmark/swe_bench_lite.py
@ -0,0 +1,71 @@
 import matplotlib.pyplot as plt
 from imgcat import imgcat
 from matplotlib import rc
 def plot_swe_bench_lite(data_file):
    with open(data_file, "r") as file:
        lines = file.readlines()
    models = []
    pass_rates = []
    for line in lines:
        if line.strip():
            pass_rate, model = line.split("%")
            model = model.strip()
            model = model.replace("|", "\n")
            models.insert(0, model.strip())
            pass_rates.insert(0, float(pass_rate.strip()))
    plt.rcParams["hatch.linewidth"] = 0.5
    plt.rcParams["hatch.color"] = "#444444"
    font_color = "#555"
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
    plt.rcParams["text.color"] = font_color
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.grid(axis="y", zorder=0, lw=0.2)
    for spine in ax.spines.values():
        spine.set_edgecolor("#DDDDDD")
        spine.set_linewidth(0.5)
    colors = ["#17965A" if "Aider" in model else "#b3d1e6" for model in models]
    bars = []
    for model, pass_rate, color in zip(models, pass_rates, colors):
        alpha = 0.6 if "Aider" in model else 0.3
        bar = ax.bar(model, pass_rate, color=color, alpha=alpha, zorder=3)
        bars.append(bar[0])
    for model, bar in zip(models, bars):
        yval = bar.get_height()
        y = yval + 0.75 if "Aider" in model else yval - 1.25
        va = "bottom" if "Aider" in model else "top"
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            y,
            f"{yval}%",
            ha="center",
            va=va,
            fontsize=14,
        )
    # ax.set_xlabel("Models", fontsize=18)
    ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color)
    ax.set_title("SWE Bench Lite", fontsize=20)
    ax.set_ylim(0, 29.9)
    plt.xticks(
        fontsize=16,
        color=font_color,
    )
    plt.tight_layout(pad=3.0)
    plt.savefig("swe_bench_lite.jpg")
    plt.savefig("swe_bench_lite.svg")
    imgcat(fig)
    ax.xaxis.label.set_color(font_color)
 # Example usage
 plot_swe_bench_lite("benchmark/tmp.txt")