diff --git a/HISTORY.md b/HISTORY.md index ea125c884..7aab4df7c 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -1,6 +1,10 @@ # Release history +### main + +- Aider will notice if you paste a URL into the chat, and offer to scrape it. + ### v0.36.0 - [Aider can now lint your code and fix any errors](https://aider.chat/2024/05/22/linting.html). diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md new file mode 100644 index 000000000..638ee4a37 --- /dev/null +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -0,0 +1,398 @@ +--- +title: Aider scores 26.3% on SWE Bench Lite +excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result. +highlight_image: /assets/swe_bench_lite.jpg +draft: true +--- + +# Aider scores 26.3% on SWE Bench Lite + +Aider scored 26.3% +on the +[SWE Bench Lite benchmark](https://www.swebench.com), +achieving a state-of-the-art result. +The current top leaderboard entry is 20.3% +from Amazon Q Developer Agent. +The best result reported elsewhere seems to be +[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover). + +[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg) + +## Interactive, not agentic + +Aider achieved this result mainly through its focus on static code analysis, +reliable LLM code editing, +and pragmatic workflows for interactive pair programming with AI. +Aider intentionally has quite limited and narrow "agentic behavior" +to avoid long delays, high token costs +and the need for users to repeatedly code review incorrect solutions. +It's also worth noting that aider currently does not use +RAG, vector search, tools or give the LLM access to search the web +or unilaterally execute code. + +Aider is first and foremost an interactive tool for engineers to get real work done in +real code bases using a chat interface. +Aider provides a pair programming experience where users can ask for a change +and see the edits performed in real-time. +Aider can also offer additional help like fixing lint or test errors, +but the user is always in full interactive control. +This lets them quickly steer misunderstandings back on course and +avoid wasting time and token costs. + + +## Benchmark methodology + +For the benchmark, +aider was launched in each problem's git repository +with the problem statement +submitted as the opening chat message from "the user." +After that aider runs as normal, with the following modifications: + +- Aider's suggestions were always accepted without user approval. +- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*. +Plausibly correct means that aider concluded that it had successfully edited the repo +without causing syntax errors or breaking any *pre-existing* tests. +- If the solution isn't plausible, the harness launches aider to try again from scratch, +alternating between using aider with GPT-4o and Opus. +- If no plausible solution is found after six tries, the harness picks the solution +with the least amount of edit/lint/test problems. + +It's important to be clear that +*aider and the benchmark harness +only had access to the pre-existing tests in each problem's repo*. +They could not see or run the held out "acceptance tests" that are used +after benchmarking to see if the +SWE Bench problem was correctly resolved. + +The benchmarking process was similar to how a developer might use aider to +resolve a GitHub issue: + +- They could launch aider in their repo with the command below, which +tells aider they want to accept every suggestion +and to use pytest to run tests. + - `aider --yes --test-cmd pytest` +- They could start the chat by pasting in the URL or text of a GitHub issue. +Aider will pull in the URL's content and then try and solve the issue. +- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time. +[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git), +so it's always easy to revert AI changes that don't pan out. + +Outside a benchmark setting, it's probably +unwise or at least highly inefficient +to let *any* AI agent run unsupervised on your code base. +The reason aider is intended to be used interactively +is so that the user can participate and direct aider's work and approve suggestions. +This way the user can offer immediate feedback or corrections if their initial +instructions turn out to be ambiguous, +or if the AI starts going down a wrong path. + +## Aider with GPT-4o alone was SOTA + +Running the benchmark harness +only using aider with GPT-4o to find plausible solutions +achieved a score of 25.0%. +This was itself a state-of-the-art result, before being surpassed by the main +result being reported here +that used aider with both GPT-4o & Opus. + +As noted below, a single attempt using Aider with GPT-4o tied +the current top entry on the leaderboard. + +## Aider with GPT-4o & Opus + +The benchmark harness alternated between running aider with GPT-4o and Opus. +The harness proceeded in a fixed order, always starting with GPT-4o and +then alternating with Opus until a plausible solution was found for each +problem. + +The table below breaks down the plausible solutions that +were found for the 300 problems. +It also provides details on the 79 that were ultimately +verified as correctly resolving their issue. +Some noteworthy observations: + +- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. +- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results. +These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions. +- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem. + + +| Attempt | Agent |Number of
plausible
solutions|Percent of
plausible
solutions| Number of
correctly
resolved
solutions | Percent of
correctly
resolved
solutions | Score on
SWE Bench
Lite
(resolved/300) | +|:--------:|------------|---------:|---------:|----:|---:|--:| +| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% | +| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% | +| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% | +| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% | +| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% | +| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% | +| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** | + + +If we break down correct solutions purely by model, +we can see that aider with GPT-4o outperforms Opus. +This isn't a fair and direct comparison, because GPT-4o always took the first +turn and therefore got first crack at all the "easiest" problems. +Aider with Opus only ever saw problems that GPT-4o failed to +find plausible solutions for on its first try. + +Aider with GPT-4o was producing higher quality plausible solutions, +with a greater chance of going on to be accepted as resolving the issue. +Again, this is biased by the turn ordering. +But other anecdotal evidence from earlier runs of the benchmark +also supports the observation that aider with GPT-4o is significantly stronger than Opus +for this benchmark. + + +| Agent | Number of
plausible
solutions | Number of
correctly
resolved
solutions | Percent of
plausible
which
correctly
resolved
| +|------------|---------:|---------:|---:| +| Aider with GPT-4o | 239 | 66 |27.6% | +| Aider with Opus | 61 | 13 |21.3% | +| **Total** | **300** | **79** |**26.3%** | + +## Repository map, not RAG + +The crucial first step in solving a SWE Bench problem is figuring out +which parts of the repo are relevant and which files need to be edited. +Most coding agents use some combination of RAG, vector search +and providing the LLM with +tools to interactively explore the code base. + +Aider instead uses a +[repository map](https://aider.chat/2023/10/22/repomap.html) +to help the LLM understand the +layout, code structure, and content of a git repo. +The repo map is created through static analysis of the code's +abstract syntax tree and call graph +to provide a compact and powerful summary of the entire code base. +The map is constantly +tailored to show +repo context that is relevant to the current state of the chat conversation. +This is done by performing a graph optimization on the code's call graph. + +When the user asks for a change to their code, the LLM can use the repo map +to decide which files to edit. +The LLM simply returns a normal text response explaining which files +it needs to edit and why. +Aider notices when the LLM mentions filenames from the repo, +and asks the user if they should be added to the chat. +Adding a file to the chat allows the LLM to see the full contents +of the file and edit it. + +
+ +#### Please add a new /factorial/N endpoint. + +To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py. +Please add app.py to the chat so I can proceed with the changes. + +> app.py +> Add these files to the chat? yes + +
+ +This is a convenient and natural workflow for interactive chat, +and it worked well for the SWE Bench problems. +Aider successfully identified the correct file to edit +in 70.3% of the benchmark tasks. + +We can determine which file needed to be edited using the "gold" patch +which is associated with each SWE Bench task. +This patch was created by a human developer +to solve the issue, and therefore reveals a file which can +be edited to solve the problem. +Of course aider is not able to see or use the gold patch +or the file names it contains in any way. +This information was only used to compute +statistics outside the benchmarking process. + + +## Reliable code editing + +Once files have been selected for editing, +the next step is of course to edit the source code to fix the problem. + +Aider goes to great lengths to ensure that LLMs can not just write code, +but reliably *edit* code. +Aider has a collection of prompting strategies and code editing backends which have +been honed through +[extensive benchmarking](https://aider.chat/docs/leaderboards/). +These foundational capabilities help ensure that aider can +properly integrate code from LLMs into an existing code base and source files. + +The repository map helps here too, making sure that the LLM +can see relevant classes, functions and variables from the entire repo. +This helps ensure that the project's existing APIs and conventions are +respected and utilized when new code is added. + +Regardless, there are still cases where aider may be unable to cleanly +complete the edits specified by the LLM. +This is usually because the LLM has failed to conform to the editing +instructions in its system prompt. +When aider completes, it returns an editing outcome that indicates +whether it was able to successfully complete all edits. +The benchmark harness uses this editing status as +one criteria to determine if aider has +created a plausible solution. + +## Linting and fixing + +Another key criteria for a plausible solution is that it passes basic +linting, which means that the code is valid and without syntax +or other fatal errors. +[Aider lints code](https://aider.chat/2024/05/22/linting.html) +after every LLM edit and offers to automatically fix +any problems. + +Aider ships with built-in linters based on tree-sitter +which work with most popular programming languages. +Aider shows linting errors to the LLM in a novel format, +using the abstract syntax tree to display relevant code context for each +error. +This context helps LLMs understand the problem and +make the correct changes to resolve it. + +
+ +``` +app.py:23:36: F821 undefined name 'num' + +app.py: +...⋮... + 6│class LongNum: +...⋮... + 19│ def expound(self, threshold): + 20│ number = self.basis + 21│ while number < threshold: + 22│ number *= self.factor + 23█ return num + 24│ + 25│ +...⋮... +``` + +> Attempt to fix lint errors? yes + +
+ +In the benchmark, these linting suggestions are always accepted. +At completion, +aider reports a linting outcome that +indicates if it was able to produce +code without any outstanding linting errors. +The benchmark harness uses this status as +one of the criteria to determine if aider has +created a plausible solution. + +## Testing and fixing + +The final crtieria for a plausible solution is that +all tests must be passing. +Aider can be configured with the command to run tests for a repo, +and will automatically attempt to fix any test failures. + +A user working on a python project might configure testing +by launching aider like this: + +``` +aider --test-cmd pytest +``` + +For the benchmark, aider is configured with a test command that will run the +tests that already exist in each problem's repository. +SWE Bench problems are based on repositories from large open +source projects with extensive existing test suites. +This means that +testing will fail if aider has broken any of these +pre-existing tests or if any new +tests that it created aren't passing. + +As with editing and linting, aider reports a testing outcome +that indicates if it completed with any outstanding failing tests. +The benchmark harness uses this status when deciding if aider +has produced a plausible solution. + +To be clear, *aider cannot run or even see the held out "acceptance tests"* that +are used to determine if a proposed solution correctly +resolves the problem. +Those tests are only run outside of aider and the benchmark harness, +to compute the final benchmark score. + +## Finding a plausible solution + +Each time aider executes, it reports +the outcome of the editing, linting, and testing +steps. +Each of these steps may complete successfully or +return a status that indicates that there were outstanding +problems that remain unresolved. + +The benchmark harness uses these outcomes to determine if +aider has produced a plausible +solution to the current SWE Bench task. +A plausible solution is one where aider +returns saying that it +edited the repo with no outstanding +edit, lint, or test errors. +In this case, aider's changes are recorded +as the SWE Bench `model_patch` to be evaluated later with the +acceptance tests. + +If the solution is not plausible, another +instance of aider is launched again from scratch on the same problem. +The harness alternates launching aider with GPT-4o and Opus to solve the problem, +and gives each model three attempts -- for a total of six attempts. +As soon as a plausible solution is found, it is accepted and the +harness moves on to the next SWE Bench instance. + +It's worth noting that repositories may have lint or test errors +present before aider even starts to edit them. +Whether unresolved errors were caused by aider or were pre-existing, +there will be instances where +no plausible solution is +found after six tries. + +If all six attempts fail to produce a plausible solution, +then the "best" solution available is selected as the +`model_patch`. +Which of the non-plausible solutions to use is determined +by ignoring the testing outcome +and prioritizing solutions in the following order: + + - Pick a solution where editing and linting were completed successfully. + - Pick a solution where editing was at least partially successful and linting succeeded. + - Pick a solution where editing was successful. + - Pick a solution where editing was at least partially successful. + +## Computing the benchmark score + +The benchmark harness produces a candidate solution for each of the 300 +SWE Bench Lite instances and saves it as the `model_patch`. + +A separate evaluation script +tests each of these solutions with the full test suite +including the held out acceptance tests. +For this final acceptance testing, any edits that aider made to tests +are discarded. +This ensures that the full, correct test suite is used for acceptance testing. +The evaluation script compares the test results +with results from testing +the "gold" patch that was developed by a human to correctly solve the issue. +If they match, the candidate solution has correctly resolved the issue. + +These acceptance tests are only ever run outside of aider +and the benchmark harness, and only to compute the number of +correctly resolved instances. +They are never run, used, or even visible during aider's attempts to solve the problems. + +Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%. + +## Acknowledgments + +Much thanks to the team behind the +[SWE Bench](https://www.swebench.com) +family of AI coding benchmarks. +Also thanks to Albert Örwall who has +[dockerized the SWE Bench evaluation scripts](SWE-bench-docker) +making it faster, easier, and more reliable to run the acceptance tests. + + diff --git a/aider/coders/base_coder.py b/aider/coders/base_coder.py index 6e7d87bf9..93442c2ea 100755 --- a/aider/coders/base_coder.py +++ b/aider/coders/base_coder.py @@ -618,6 +618,20 @@ class Coder: return self.commands.run(inp) self.check_for_file_mentions(inp) + inp = self.check_for_urls(inp) + + return inp + + def check_for_urls(self, inp): + url_pattern = re.compile( + r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" + ) + urls = url_pattern.findall(inp) + for url in urls: + if self.io.confirm_ask(f"Add {url} to the chat?"): + inp += "\n\n" + inp += self.commands.cmd_web(url) + return inp def keyboard_interrupt(self): diff --git a/aider/commands.py b/aider/commands.py index 8e8f830d3..cf52c1d18 100644 --- a/aider/commands.py +++ b/aider/commands.py @@ -69,8 +69,8 @@ class Commands: self.scraper = Scraper(print_error=self.io.tool_error) content = self.scraper.scrape(url) or "" - if content: - self.io.tool_output(content) + # if content: + # self.io.tool_output(content) instructions = self.scraper.get_playwright_instructions() if instructions: diff --git a/assets/swe_bench_lite.jpg b/assets/swe_bench_lite.jpg new file mode 100644 index 000000000..40adf98fd Binary files /dev/null and b/assets/swe_bench_lite.jpg differ diff --git a/assets/swe_bench_lite.svg b/assets/swe_bench_lite.svg new file mode 100644 index 000000000..0e67796c9 --- /dev/null +++ b/assets/swe_bench_lite.svg @@ -0,0 +1,1750 @@ + + + + + + + + 2024-05-23T13:12:59.895266 + image/svg+xml + + + Matplotlib v3.9.0, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/benchmark/over_time.py b/benchmark/over_time.py index 33e80e67e..0ea641d64 100644 --- a/benchmark/over_time.py +++ b/benchmark/over_time.py @@ -22,6 +22,7 @@ def plot_over_time(yaml_file): plt.rcParams["hatch.color"] = "#444444" rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10}) + plt.rcParams["text.color"] = "#444444" fig, ax = plt.subplots(figsize=(10, 5)) ax.grid(axis="y", zorder=0, lw=0.2) @@ -44,10 +45,12 @@ def plot_over_time(yaml_file): textcoords="offset points", ) - ax.set_xlabel("Model release date", fontsize=18) - ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18) + ax.set_xlabel("Model release date", fontsize=18, color="#555") + ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18, color="#555") ax.set_title("LLM code editing skill by model release date", fontsize=20) - plt.tight_layout() + ax.set_ylim(0, 30) + plt.xticks(fontsize=14) + plt.tight_layout(pad=3.0) plt.savefig("tmp_over_time.png") plt.savefig("tmp_over_time.svg") imgcat(fig) diff --git a/benchmark/swe_bench_lite.py b/benchmark/swe_bench_lite.py new file mode 100644 index 000000000..72106f197 --- /dev/null +++ b/benchmark/swe_bench_lite.py @@ -0,0 +1,71 @@ +import matplotlib.pyplot as plt +from imgcat import imgcat +from matplotlib import rc + + +def plot_swe_bench_lite(data_file): + with open(data_file, "r") as file: + lines = file.readlines() + + models = [] + pass_rates = [] + + for line in lines: + if line.strip(): + pass_rate, model = line.split("%") + model = model.strip() + model = model.replace("|", "\n") + models.insert(0, model.strip()) + pass_rates.insert(0, float(pass_rate.strip())) + + plt.rcParams["hatch.linewidth"] = 0.5 + plt.rcParams["hatch.color"] = "#444444" + + font_color = "#555" + rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10}) + plt.rcParams["text.color"] = font_color + + fig, ax = plt.subplots(figsize=(10, 5)) + ax.grid(axis="y", zorder=0, lw=0.2) + for spine in ax.spines.values(): + spine.set_edgecolor("#DDDDDD") + spine.set_linewidth(0.5) + + colors = ["#17965A" if "Aider" in model else "#b3d1e6" for model in models] + bars = [] + for model, pass_rate, color in zip(models, pass_rates, colors): + alpha = 0.6 if "Aider" in model else 0.3 + bar = ax.bar(model, pass_rate, color=color, alpha=alpha, zorder=3) + bars.append(bar[0]) + + for model, bar in zip(models, bars): + yval = bar.get_height() + y = yval + 0.75 if "Aider" in model else yval - 1.25 + va = "bottom" if "Aider" in model else "top" + + ax.text( + bar.get_x() + bar.get_width() / 2, + y, + f"{yval}%", + ha="center", + va=va, + fontsize=14, + ) + + # ax.set_xlabel("Models", fontsize=18) + ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color) + ax.set_title("SWE Bench Lite", fontsize=20) + ax.set_ylim(0, 29.9) + plt.xticks( + fontsize=16, + color=font_color, + ) + plt.tight_layout(pad=3.0) + plt.savefig("swe_bench_lite.jpg") + plt.savefig("swe_bench_lite.svg") + imgcat(fig) + ax.xaxis.label.set_color(font_color) + + +# Example usage +plot_swe_bench_lite("benchmark/tmp.txt")