mirror of
https://github.com/Aider-AI/aider.git
synced 2025-05-24 14:25:00 +00:00
Merge branch 'main' into swe-bench
This commit is contained in:
commit
fb76895eb1
8 changed files with 2245 additions and 5 deletions
|
@ -1,6 +1,10 @@
|
||||||
|
|
||||||
# Release history
|
# Release history
|
||||||
|
|
||||||
|
### main
|
||||||
|
|
||||||
|
- Aider will notice if you paste a URL into the chat, and offer to scrape it.
|
||||||
|
|
||||||
### v0.36.0
|
### v0.36.0
|
||||||
|
|
||||||
- [Aider can now lint your code and fix any errors](https://aider.chat/2024/05/22/linting.html).
|
- [Aider can now lint your code and fix any errors](https://aider.chat/2024/05/22/linting.html).
|
||||||
|
|
398
_posts/2024-05-22-swe-bench-lite.md
Normal file
398
_posts/2024-05-22-swe-bench-lite.md
Normal file
|
@ -0,0 +1,398 @@
|
||||||
|
---
|
||||||
|
title: Aider scores 26.3% on SWE Bench Lite
|
||||||
|
excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
|
||||||
|
highlight_image: /assets/swe_bench_lite.jpg
|
||||||
|
draft: true
|
||||||
|
---
|
||||||
|
|
||||||
|
# Aider scores 26.3% on SWE Bench Lite
|
||||||
|
|
||||||
|
Aider scored 26.3%
|
||||||
|
on the
|
||||||
|
[SWE Bench Lite benchmark](https://www.swebench.com),
|
||||||
|
achieving a state-of-the-art result.
|
||||||
|
The current top leaderboard entry is 20.3%
|
||||||
|
from Amazon Q Developer Agent.
|
||||||
|
The best result reported elsewhere seems to be
|
||||||
|
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
|
||||||
|
|
||||||
|
[](https://aider.chat/assets/swe_bench_lite.svg)
|
||||||
|
|
||||||
|
## Interactive, not agentic
|
||||||
|
|
||||||
|
Aider achieved this result mainly through its focus on static code analysis,
|
||||||
|
reliable LLM code editing,
|
||||||
|
and pragmatic workflows for interactive pair programming with AI.
|
||||||
|
Aider intentionally has quite limited and narrow "agentic behavior"
|
||||||
|
to avoid long delays, high token costs
|
||||||
|
and the need for users to repeatedly code review incorrect solutions.
|
||||||
|
It's also worth noting that aider currently does not use
|
||||||
|
RAG, vector search, tools or give the LLM access to search the web
|
||||||
|
or unilaterally execute code.
|
||||||
|
|
||||||
|
Aider is first and foremost an interactive tool for engineers to get real work done in
|
||||||
|
real code bases using a chat interface.
|
||||||
|
Aider provides a pair programming experience where users can ask for a change
|
||||||
|
and see the edits performed in real-time.
|
||||||
|
Aider can also offer additional help like fixing lint or test errors,
|
||||||
|
but the user is always in full interactive control.
|
||||||
|
This lets them quickly steer misunderstandings back on course and
|
||||||
|
avoid wasting time and token costs.
|
||||||
|
|
||||||
|
|
||||||
|
## Benchmark methodology
|
||||||
|
|
||||||
|
For the benchmark,
|
||||||
|
aider was launched in each problem's git repository
|
||||||
|
with the problem statement
|
||||||
|
submitted as the opening chat message from "the user."
|
||||||
|
After that aider runs as normal, with the following modifications:
|
||||||
|
|
||||||
|
- Aider's suggestions were always accepted without user approval.
|
||||||
|
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
|
||||||
|
Plausibly correct means that aider concluded that it had successfully edited the repo
|
||||||
|
without causing syntax errors or breaking any *pre-existing* tests.
|
||||||
|
- If the solution isn't plausible, the harness launches aider to try again from scratch,
|
||||||
|
alternating between using aider with GPT-4o and Opus.
|
||||||
|
- If no plausible solution is found after six tries, the harness picks the solution
|
||||||
|
with the least amount of edit/lint/test problems.
|
||||||
|
|
||||||
|
It's important to be clear that
|
||||||
|
*aider and the benchmark harness
|
||||||
|
only had access to the pre-existing tests in each problem's repo*.
|
||||||
|
They could not see or run the held out "acceptance tests" that are used
|
||||||
|
after benchmarking to see if the
|
||||||
|
SWE Bench problem was correctly resolved.
|
||||||
|
|
||||||
|
The benchmarking process was similar to how a developer might use aider to
|
||||||
|
resolve a GitHub issue:
|
||||||
|
|
||||||
|
- They could launch aider in their repo with the command below, which
|
||||||
|
tells aider they want to accept every suggestion
|
||||||
|
and to use pytest to run tests.
|
||||||
|
- `aider --yes --test-cmd pytest`
|
||||||
|
- They could start the chat by pasting in the URL or text of a GitHub issue.
|
||||||
|
Aider will pull in the URL's content and then try and solve the issue.
|
||||||
|
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe using aider with a different LLM this time.
|
||||||
|
[Aider is tightly integrated with git](https://aider.chat/docs/faq.html#how-does-aider-use-git),
|
||||||
|
so it's always easy to revert AI changes that don't pan out.
|
||||||
|
|
||||||
|
Outside a benchmark setting, it's probably
|
||||||
|
unwise or at least highly inefficient
|
||||||
|
to let *any* AI agent run unsupervised on your code base.
|
||||||
|
The reason aider is intended to be used interactively
|
||||||
|
is so that the user can participate and direct aider's work and approve suggestions.
|
||||||
|
This way the user can offer immediate feedback or corrections if their initial
|
||||||
|
instructions turn out to be ambiguous,
|
||||||
|
or if the AI starts going down a wrong path.
|
||||||
|
|
||||||
|
## Aider with GPT-4o alone was SOTA
|
||||||
|
|
||||||
|
Running the benchmark harness
|
||||||
|
only using aider with GPT-4o to find plausible solutions
|
||||||
|
achieved a score of 25.0%.
|
||||||
|
This was itself a state-of-the-art result, before being surpassed by the main
|
||||||
|
result being reported here
|
||||||
|
that used aider with both GPT-4o & Opus.
|
||||||
|
|
||||||
|
As noted below, a single attempt using Aider with GPT-4o tied
|
||||||
|
the current top entry on the leaderboard.
|
||||||
|
|
||||||
|
## Aider with GPT-4o & Opus
|
||||||
|
|
||||||
|
The benchmark harness alternated between running aider with GPT-4o and Opus.
|
||||||
|
The harness proceeded in a fixed order, always starting with GPT-4o and
|
||||||
|
then alternating with Opus until a plausible solution was found for each
|
||||||
|
problem.
|
||||||
|
|
||||||
|
The table below breaks down the plausible solutions that
|
||||||
|
were found for the 300 problems.
|
||||||
|
It also provides details on the 79 that were ultimately
|
||||||
|
verified as correctly resolving their issue.
|
||||||
|
Some noteworthy observations:
|
||||||
|
|
||||||
|
- *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
|
||||||
|
- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results.
|
||||||
|
These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions.
|
||||||
|
- A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.
|
||||||
|
|
||||||
|
|
||||||
|
| Attempt | Agent |Number of<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number of<br/>correctly<br>resolved<br>solutions | Percent of<br>correctly<br>resolved<br>solutions | Score on<br>SWE Bench<br>Lite<br>(resolved/300) |
|
||||||
|
|:--------:|------------|---------:|---------:|----:|---:|--:|
|
||||||
|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
|
||||||
|
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
|
||||||
|
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
|
||||||
|
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
|
||||||
|
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
|
||||||
|
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
|
||||||
|
| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
|
||||||
|
|
||||||
|
|
||||||
|
If we break down correct solutions purely by model,
|
||||||
|
we can see that aider with GPT-4o outperforms Opus.
|
||||||
|
This isn't a fair and direct comparison, because GPT-4o always took the first
|
||||||
|
turn and therefore got first crack at all the "easiest" problems.
|
||||||
|
Aider with Opus only ever saw problems that GPT-4o failed to
|
||||||
|
find plausible solutions for on its first try.
|
||||||
|
|
||||||
|
Aider with GPT-4o was producing higher quality plausible solutions,
|
||||||
|
with a greater chance of going on to be accepted as resolving the issue.
|
||||||
|
Again, this is biased by the turn ordering.
|
||||||
|
But other anecdotal evidence from earlier runs of the benchmark
|
||||||
|
also supports the observation that aider with GPT-4o is significantly stronger than Opus
|
||||||
|
for this benchmark.
|
||||||
|
|
||||||
|
|
||||||
|
| Agent | Number of<br>plausible<br>solutions | Number of<br>correctly<br>resolved<br>solutions | Percent of<br>plausible<br>which<br>correctly<br>resolved<br>|
|
||||||
|
|------------|---------:|---------:|---:|
|
||||||
|
| Aider with GPT-4o | 239 | 66 |27.6% |
|
||||||
|
| Aider with Opus | 61 | 13 |21.3% |
|
||||||
|
| **Total** | **300** | **79** |**26.3%** |
|
||||||
|
|
||||||
|
## Repository map, not RAG
|
||||||
|
|
||||||
|
The crucial first step in solving a SWE Bench problem is figuring out
|
||||||
|
which parts of the repo are relevant and which files need to be edited.
|
||||||
|
Most coding agents use some combination of RAG, vector search
|
||||||
|
and providing the LLM with
|
||||||
|
tools to interactively explore the code base.
|
||||||
|
|
||||||
|
Aider instead uses a
|
||||||
|
[repository map](https://aider.chat/2023/10/22/repomap.html)
|
||||||
|
to help the LLM understand the
|
||||||
|
layout, code structure, and content of a git repo.
|
||||||
|
The repo map is created through static analysis of the code's
|
||||||
|
abstract syntax tree and call graph
|
||||||
|
to provide a compact and powerful summary of the entire code base.
|
||||||
|
The map is constantly
|
||||||
|
tailored to show
|
||||||
|
repo context that is relevant to the current state of the chat conversation.
|
||||||
|
This is done by performing a graph optimization on the code's call graph.
|
||||||
|
|
||||||
|
When the user asks for a change to their code, the LLM can use the repo map
|
||||||
|
to decide which files to edit.
|
||||||
|
The LLM simply returns a normal text response explaining which files
|
||||||
|
it needs to edit and why.
|
||||||
|
Aider notices when the LLM mentions filenames from the repo,
|
||||||
|
and asks the user if they should be added to the chat.
|
||||||
|
Adding a file to the chat allows the LLM to see the full contents
|
||||||
|
of the file and edit it.
|
||||||
|
|
||||||
|
<div class="chat-transcript" markdown="1">
|
||||||
|
|
||||||
|
#### Please add a new /factorial/N endpoint.
|
||||||
|
|
||||||
|
To add a new /factorial/N endpoint, the most likely file that needs to be edited is app.py.
|
||||||
|
Please add app.py to the chat so I can proceed with the changes.
|
||||||
|
|
||||||
|
> app.py
|
||||||
|
> Add these files to the chat? yes
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
This is a convenient and natural workflow for interactive chat,
|
||||||
|
and it worked well for the SWE Bench problems.
|
||||||
|
Aider successfully identified the correct file to edit
|
||||||
|
in 70.3% of the benchmark tasks.
|
||||||
|
|
||||||
|
We can determine which file needed to be edited using the "gold" patch
|
||||||
|
which is associated with each SWE Bench task.
|
||||||
|
This patch was created by a human developer
|
||||||
|
to solve the issue, and therefore reveals a file which can
|
||||||
|
be edited to solve the problem.
|
||||||
|
Of course aider is not able to see or use the gold patch
|
||||||
|
or the file names it contains in any way.
|
||||||
|
This information was only used to compute
|
||||||
|
statistics outside the benchmarking process.
|
||||||
|
|
||||||
|
|
||||||
|
## Reliable code editing
|
||||||
|
|
||||||
|
Once files have been selected for editing,
|
||||||
|
the next step is of course to edit the source code to fix the problem.
|
||||||
|
|
||||||
|
Aider goes to great lengths to ensure that LLMs can not just write code,
|
||||||
|
but reliably *edit* code.
|
||||||
|
Aider has a collection of prompting strategies and code editing backends which have
|
||||||
|
been honed through
|
||||||
|
[extensive benchmarking](https://aider.chat/docs/leaderboards/).
|
||||||
|
These foundational capabilities help ensure that aider can
|
||||||
|
properly integrate code from LLMs into an existing code base and source files.
|
||||||
|
|
||||||
|
The repository map helps here too, making sure that the LLM
|
||||||
|
can see relevant classes, functions and variables from the entire repo.
|
||||||
|
This helps ensure that the project's existing APIs and conventions are
|
||||||
|
respected and utilized when new code is added.
|
||||||
|
|
||||||
|
Regardless, there are still cases where aider may be unable to cleanly
|
||||||
|
complete the edits specified by the LLM.
|
||||||
|
This is usually because the LLM has failed to conform to the editing
|
||||||
|
instructions in its system prompt.
|
||||||
|
When aider completes, it returns an editing outcome that indicates
|
||||||
|
whether it was able to successfully complete all edits.
|
||||||
|
The benchmark harness uses this editing status as
|
||||||
|
one criteria to determine if aider has
|
||||||
|
created a plausible solution.
|
||||||
|
|
||||||
|
## Linting and fixing
|
||||||
|
|
||||||
|
Another key criteria for a plausible solution is that it passes basic
|
||||||
|
linting, which means that the code is valid and without syntax
|
||||||
|
or other fatal errors.
|
||||||
|
[Aider lints code](https://aider.chat/2024/05/22/linting.html)
|
||||||
|
after every LLM edit and offers to automatically fix
|
||||||
|
any problems.
|
||||||
|
|
||||||
|
Aider ships with built-in linters based on tree-sitter
|
||||||
|
which work with most popular programming languages.
|
||||||
|
Aider shows linting errors to the LLM in a novel format,
|
||||||
|
using the abstract syntax tree to display relevant code context for each
|
||||||
|
error.
|
||||||
|
This context helps LLMs understand the problem and
|
||||||
|
make the correct changes to resolve it.
|
||||||
|
|
||||||
|
<div class="chat-transcript" markdown="1">
|
||||||
|
|
||||||
|
```
|
||||||
|
app.py:23:36: F821 undefined name 'num'
|
||||||
|
|
||||||
|
app.py:
|
||||||
|
...⋮...
|
||||||
|
6│class LongNum:
|
||||||
|
...⋮...
|
||||||
|
19│ def expound(self, threshold):
|
||||||
|
20│ number = self.basis
|
||||||
|
21│ while number < threshold:
|
||||||
|
22│ number *= self.factor
|
||||||
|
23█ return num
|
||||||
|
24│
|
||||||
|
25│
|
||||||
|
...⋮...
|
||||||
|
```
|
||||||
|
|
||||||
|
> Attempt to fix lint errors? yes
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
In the benchmark, these linting suggestions are always accepted.
|
||||||
|
At completion,
|
||||||
|
aider reports a linting outcome that
|
||||||
|
indicates if it was able to produce
|
||||||
|
code without any outstanding linting errors.
|
||||||
|
The benchmark harness uses this status as
|
||||||
|
one of the criteria to determine if aider has
|
||||||
|
created a plausible solution.
|
||||||
|
|
||||||
|
## Testing and fixing
|
||||||
|
|
||||||
|
The final crtieria for a plausible solution is that
|
||||||
|
all tests must be passing.
|
||||||
|
Aider can be configured with the command to run tests for a repo,
|
||||||
|
and will automatically attempt to fix any test failures.
|
||||||
|
|
||||||
|
A user working on a python project might configure testing
|
||||||
|
by launching aider like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
aider --test-cmd pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
For the benchmark, aider is configured with a test command that will run the
|
||||||
|
tests that already exist in each problem's repository.
|
||||||
|
SWE Bench problems are based on repositories from large open
|
||||||
|
source projects with extensive existing test suites.
|
||||||
|
This means that
|
||||||
|
testing will fail if aider has broken any of these
|
||||||
|
pre-existing tests or if any new
|
||||||
|
tests that it created aren't passing.
|
||||||
|
|
||||||
|
As with editing and linting, aider reports a testing outcome
|
||||||
|
that indicates if it completed with any outstanding failing tests.
|
||||||
|
The benchmark harness uses this status when deciding if aider
|
||||||
|
has produced a plausible solution.
|
||||||
|
|
||||||
|
To be clear, *aider cannot run or even see the held out "acceptance tests"* that
|
||||||
|
are used to determine if a proposed solution correctly
|
||||||
|
resolves the problem.
|
||||||
|
Those tests are only run outside of aider and the benchmark harness,
|
||||||
|
to compute the final benchmark score.
|
||||||
|
|
||||||
|
## Finding a plausible solution
|
||||||
|
|
||||||
|
Each time aider executes, it reports
|
||||||
|
the outcome of the editing, linting, and testing
|
||||||
|
steps.
|
||||||
|
Each of these steps may complete successfully or
|
||||||
|
return a status that indicates that there were outstanding
|
||||||
|
problems that remain unresolved.
|
||||||
|
|
||||||
|
The benchmark harness uses these outcomes to determine if
|
||||||
|
aider has produced a plausible
|
||||||
|
solution to the current SWE Bench task.
|
||||||
|
A plausible solution is one where aider
|
||||||
|
returns saying that it
|
||||||
|
edited the repo with no outstanding
|
||||||
|
edit, lint, or test errors.
|
||||||
|
In this case, aider's changes are recorded
|
||||||
|
as the SWE Bench `model_patch` to be evaluated later with the
|
||||||
|
acceptance tests.
|
||||||
|
|
||||||
|
If the solution is not plausible, another
|
||||||
|
instance of aider is launched again from scratch on the same problem.
|
||||||
|
The harness alternates launching aider with GPT-4o and Opus to solve the problem,
|
||||||
|
and gives each model three attempts -- for a total of six attempts.
|
||||||
|
As soon as a plausible solution is found, it is accepted and the
|
||||||
|
harness moves on to the next SWE Bench instance.
|
||||||
|
|
||||||
|
It's worth noting that repositories may have lint or test errors
|
||||||
|
present before aider even starts to edit them.
|
||||||
|
Whether unresolved errors were caused by aider or were pre-existing,
|
||||||
|
there will be instances where
|
||||||
|
no plausible solution is
|
||||||
|
found after six tries.
|
||||||
|
|
||||||
|
If all six attempts fail to produce a plausible solution,
|
||||||
|
then the "best" solution available is selected as the
|
||||||
|
`model_patch`.
|
||||||
|
Which of the non-plausible solutions to use is determined
|
||||||
|
by ignoring the testing outcome
|
||||||
|
and prioritizing solutions in the following order:
|
||||||
|
|
||||||
|
- Pick a solution where editing and linting were completed successfully.
|
||||||
|
- Pick a solution where editing was at least partially successful and linting succeeded.
|
||||||
|
- Pick a solution where editing was successful.
|
||||||
|
- Pick a solution where editing was at least partially successful.
|
||||||
|
|
||||||
|
## Computing the benchmark score
|
||||||
|
|
||||||
|
The benchmark harness produces a candidate solution for each of the 300
|
||||||
|
SWE Bench Lite instances and saves it as the `model_patch`.
|
||||||
|
|
||||||
|
A separate evaluation script
|
||||||
|
tests each of these solutions with the full test suite
|
||||||
|
including the held out acceptance tests.
|
||||||
|
For this final acceptance testing, any edits that aider made to tests
|
||||||
|
are discarded.
|
||||||
|
This ensures that the full, correct test suite is used for acceptance testing.
|
||||||
|
The evaluation script compares the test results
|
||||||
|
with results from testing
|
||||||
|
the "gold" patch that was developed by a human to correctly solve the issue.
|
||||||
|
If they match, the candidate solution has correctly resolved the issue.
|
||||||
|
|
||||||
|
These acceptance tests are only ever run outside of aider
|
||||||
|
and the benchmark harness, and only to compute the number of
|
||||||
|
correctly resolved instances.
|
||||||
|
They are never run, used, or even visible during aider's attempts to solve the problems.
|
||||||
|
|
||||||
|
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
|
||||||
|
|
||||||
|
## Acknowledgments
|
||||||
|
|
||||||
|
Much thanks to the team behind the
|
||||||
|
[SWE Bench](https://www.swebench.com)
|
||||||
|
family of AI coding benchmarks.
|
||||||
|
Also thanks to Albert Örwall who has
|
||||||
|
[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
|
||||||
|
making it faster, easier, and more reliable to run the acceptance tests.
|
||||||
|
|
||||||
|
|
|
@ -618,6 +618,20 @@ class Coder:
|
||||||
return self.commands.run(inp)
|
return self.commands.run(inp)
|
||||||
|
|
||||||
self.check_for_file_mentions(inp)
|
self.check_for_file_mentions(inp)
|
||||||
|
inp = self.check_for_urls(inp)
|
||||||
|
|
||||||
|
return inp
|
||||||
|
|
||||||
|
def check_for_urls(self, inp):
|
||||||
|
url_pattern = re.compile(
|
||||||
|
r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
|
||||||
|
)
|
||||||
|
urls = url_pattern.findall(inp)
|
||||||
|
for url in urls:
|
||||||
|
if self.io.confirm_ask(f"Add {url} to the chat?"):
|
||||||
|
inp += "\n\n"
|
||||||
|
inp += self.commands.cmd_web(url)
|
||||||
|
|
||||||
return inp
|
return inp
|
||||||
|
|
||||||
def keyboard_interrupt(self):
|
def keyboard_interrupt(self):
|
||||||
|
|
|
@ -69,8 +69,8 @@ class Commands:
|
||||||
self.scraper = Scraper(print_error=self.io.tool_error)
|
self.scraper = Scraper(print_error=self.io.tool_error)
|
||||||
|
|
||||||
content = self.scraper.scrape(url) or ""
|
content = self.scraper.scrape(url) or ""
|
||||||
if content:
|
# if content:
|
||||||
self.io.tool_output(content)
|
# self.io.tool_output(content)
|
||||||
|
|
||||||
instructions = self.scraper.get_playwright_instructions()
|
instructions = self.scraper.get_playwright_instructions()
|
||||||
if instructions:
|
if instructions:
|
||||||
|
|
BIN
assets/swe_bench_lite.jpg
Normal file
BIN
assets/swe_bench_lite.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 36 KiB |
1750
assets/swe_bench_lite.svg
Normal file
1750
assets/swe_bench_lite.svg
Normal file
File diff suppressed because it is too large
Load diff
After Width: | Height: | Size: 43 KiB |
|
@ -22,6 +22,7 @@ def plot_over_time(yaml_file):
|
||||||
plt.rcParams["hatch.color"] = "#444444"
|
plt.rcParams["hatch.color"] = "#444444"
|
||||||
|
|
||||||
rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
|
rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
|
||||||
|
plt.rcParams["text.color"] = "#444444"
|
||||||
|
|
||||||
fig, ax = plt.subplots(figsize=(10, 5))
|
fig, ax = plt.subplots(figsize=(10, 5))
|
||||||
ax.grid(axis="y", zorder=0, lw=0.2)
|
ax.grid(axis="y", zorder=0, lw=0.2)
|
||||||
|
@ -44,10 +45,12 @@ def plot_over_time(yaml_file):
|
||||||
textcoords="offset points",
|
textcoords="offset points",
|
||||||
)
|
)
|
||||||
|
|
||||||
ax.set_xlabel("Model release date", fontsize=18)
|
ax.set_xlabel("Model release date", fontsize=18, color="#555")
|
||||||
ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18)
|
ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18, color="#555")
|
||||||
ax.set_title("LLM code editing skill by model release date", fontsize=20)
|
ax.set_title("LLM code editing skill by model release date", fontsize=20)
|
||||||
plt.tight_layout()
|
ax.set_ylim(0, 30)
|
||||||
|
plt.xticks(fontsize=14)
|
||||||
|
plt.tight_layout(pad=3.0)
|
||||||
plt.savefig("tmp_over_time.png")
|
plt.savefig("tmp_over_time.png")
|
||||||
plt.savefig("tmp_over_time.svg")
|
plt.savefig("tmp_over_time.svg")
|
||||||
imgcat(fig)
|
imgcat(fig)
|
||||||
|
|
71
benchmark/swe_bench_lite.py
Normal file
71
benchmark/swe_bench_lite.py
Normal file
|
@ -0,0 +1,71 @@
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from imgcat import imgcat
|
||||||
|
from matplotlib import rc
|
||||||
|
|
||||||
|
|
||||||
|
def plot_swe_bench_lite(data_file):
|
||||||
|
with open(data_file, "r") as file:
|
||||||
|
lines = file.readlines()
|
||||||
|
|
||||||
|
models = []
|
||||||
|
pass_rates = []
|
||||||
|
|
||||||
|
for line in lines:
|
||||||
|
if line.strip():
|
||||||
|
pass_rate, model = line.split("%")
|
||||||
|
model = model.strip()
|
||||||
|
model = model.replace("|", "\n")
|
||||||
|
models.insert(0, model.strip())
|
||||||
|
pass_rates.insert(0, float(pass_rate.strip()))
|
||||||
|
|
||||||
|
plt.rcParams["hatch.linewidth"] = 0.5
|
||||||
|
plt.rcParams["hatch.color"] = "#444444"
|
||||||
|
|
||||||
|
font_color = "#555"
|
||||||
|
rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
|
||||||
|
plt.rcParams["text.color"] = font_color
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(10, 5))
|
||||||
|
ax.grid(axis="y", zorder=0, lw=0.2)
|
||||||
|
for spine in ax.spines.values():
|
||||||
|
spine.set_edgecolor("#DDDDDD")
|
||||||
|
spine.set_linewidth(0.5)
|
||||||
|
|
||||||
|
colors = ["#17965A" if "Aider" in model else "#b3d1e6" for model in models]
|
||||||
|
bars = []
|
||||||
|
for model, pass_rate, color in zip(models, pass_rates, colors):
|
||||||
|
alpha = 0.6 if "Aider" in model else 0.3
|
||||||
|
bar = ax.bar(model, pass_rate, color=color, alpha=alpha, zorder=3)
|
||||||
|
bars.append(bar[0])
|
||||||
|
|
||||||
|
for model, bar in zip(models, bars):
|
||||||
|
yval = bar.get_height()
|
||||||
|
y = yval + 0.75 if "Aider" in model else yval - 1.25
|
||||||
|
va = "bottom" if "Aider" in model else "top"
|
||||||
|
|
||||||
|
ax.text(
|
||||||
|
bar.get_x() + bar.get_width() / 2,
|
||||||
|
y,
|
||||||
|
f"{yval}%",
|
||||||
|
ha="center",
|
||||||
|
va=va,
|
||||||
|
fontsize=14,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ax.set_xlabel("Models", fontsize=18)
|
||||||
|
ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color)
|
||||||
|
ax.set_title("SWE Bench Lite", fontsize=20)
|
||||||
|
ax.set_ylim(0, 29.9)
|
||||||
|
plt.xticks(
|
||||||
|
fontsize=16,
|
||||||
|
color=font_color,
|
||||||
|
)
|
||||||
|
plt.tight_layout(pad=3.0)
|
||||||
|
plt.savefig("swe_bench_lite.jpg")
|
||||||
|
plt.savefig("swe_bench_lite.svg")
|
||||||
|
imgcat(fig)
|
||||||
|
ax.xaxis.label.set_color(font_color)
|
||||||
|
|
||||||
|
|
||||||
|
# Example usage
|
||||||
|
plot_swe_bench_lite("benchmark/tmp.txt")
|
Loading…
Add table
Add a link
Reference in a new issue