Commit graph

4262 commits

Author SHA1 Message Date
Paul Gauthier
5032adf44b Add the url content 2024-05-22 21:25:34 -07:00
Paul Gauthier
dc8ae596c3 aider: Implement regex to find all URLs in the input string.
# Aider chat conversation:

USER: do the todo

ASSISTANT: To address the TODO comment, we need to implement a regex to find all URLs in the input string. Here are the changes:

1. Import the `re` module for regular expressions.
2. Implement the `check_for_urls` method to find and process URLs in the input string.

Here are the *SEARCH/REPLACE* blocks:

aider/coders/base_coder.py
```python
<<<<<<< SEARCH
import hashlib
=======
import hashlib
import re
>>>>>>> REPLACE
```

aider/coders/base_coder.py
```python
<<<<<<< SEARCH
        # todo use a regex to find all urls in inp

        return inp
=======
        url_pattern = re.compile(
            r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
        )
        urls = url_pattern.findall(inp)
        if urls:
            self.io.tool_output(f"Found URLs: {', '.join(urls)}")
        return inp
>>>>>>> REPLACE
```

This will implement the functionality to find all URLs in the input string using a regular expression.
2024-05-22 21:19:44 -07:00
Paul Gauthier
aca24976a7 Add method to check for URLs in input. 2024-05-22 21:19:43 -07:00
Paul Gauthier
cdf4969c01 copy 2024-05-22 20:24:10 -07:00
Paul Gauthier
8b5451f4ab copy 2024-05-22 20:15:46 -07:00
Paul Gauthier
5e13399f46 copy 2024-05-22 18:26:04 -07:00
Paul Gauthier
18e3f55c4e copy 2024-05-22 17:18:59 -07:00
Paul Gauthier
d437be5bc1 copy 2024-05-22 16:14:22 -07:00
Paul Gauthier
71ddb2799f aider: Aider achieved a state-of-the-art result on the SWE Bench Lite benchmark, with improvements in code editing, linting, and testing processes.
# Aider chat conversation:

USER: proofread this and make corrections. don't change style, but fix problems.

ASSISTANT: _posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider scored 26.3%
on the
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
The best result reported elsewhere online seems to be
=======
Aider scored 26.3% on the [SWE Bench Lite benchmark](https://www.swebench.com), achieving a state-of-the-art result. The current top leaderboard entry is 20.3% from Amazon Q Developer Agent. The best result reported elsewhere online seems to be
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider achieved this result mainly through its focus on static code analysis,
reliable LLM code editing
and pragmatic workflows for interactive pair programming with AI.
Aider intentionally has quite limited and narrow "agentic behavior":
it doesn't require a highly detailed upfront "spec" from
^C KeyboardInterrupt

USER: don't fix formatting like that. linebreaks are fine. it will render properly.

ASSISTANT: Understood. I will only fix grammatical and spelling errors without changing the formatting.

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider scored 26.3%
on the
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
The best result reported elsewhere online seems to be
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
=======
Aider scored 26.3%
on the
[SWE Bench Lite benchmark](https://www.swebench.com), achieving a state-of-the-art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
The best result reported elsewhere online seems to be
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider achieved this result mainly through its focus on static code analysis,
reliable LLM code editing
and pragmatic workflows for interactive pair programming with AI.
Aider intentionally has quite limited and narrow "agentic behavior":
it doesn't require a highly detailed upfront "spec" from the user,
use RAG or vector search, farm out sub-problems to an army of LLMs,
allow the LLM to use tools
or perform  web searches,
=======
Aider achieved this result mainly through its focus on static code analysis,
reliable LLM code editing,
and pragmatic workflows for interactive pair programming with AI.
Aider intentionally has quite limited and narrow "agentic behavior":
it doesn't require a highly detailed upfront "spec" from the user,
use RAG or vector search, farm out sub-problems to an army of LLMs,
allow the LLM to use tools,
or perform web searches,
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
When a user asks aider for a change, they see the edits performed in real-time
and aider may also then offer additional
help like fixing lint or test errors.
In normal use, the user is in full interactive control.
This lets them quickly steer misunderstandings back on course and
=======
When a user asks aider for a change, they see the edits performed in real-time,
and aider may also then offer additional
help like fixing lint or test errors.
In normal use, the user is in full interactive control.
This lets them quickly steer misunderstandings back on course and
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
For the benchmark,
aider was launched in each problem's git repository
with the problem statement
submitted as the opening chat message from "the user".
After that aider runs as normal, with the following modifications:
=======
For the benchmark,
aider was launched in each problem's git repository
with the problem statement
submitted as the opening chat message from "the user."
After that aider runs as normal, with the following modifications:
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
- A simple harness was used to retry the SWE Bench problem if aider produced code which wasn't *plausibly correct*.
=======
- A simple harness was used to retry the SWE Bench problem if aider produced code that wasn't *plausibly correct*.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
- If the solution isn't plausible, the harness launches aider to try again from scratch
=======
- If the solution isn't plausible, the harness launches aider to try again from scratch,
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again. Maybe with a different LLM this time.
=======
- If aider doesn't produce code that lints and tests clean, the user might decide to revert the changes and try again, maybe with a different LLM this time.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Of course, outside a benchmark setting it's probably
=======
Of course, outside a benchmark setting, it's probably
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
achieved a score of 25%.
This was itself a state of the art result, before being surpassed by the main
result being reported here
that used aider with both GPT-4o & Opus.
=======
Running the entire SWE Bench Lite benchmark using aider with just GPT-4o
achieved a score of 25%.
This was itself a state-of-the-art result, before being surpassed by the main
result being reported here
that used aider with both GPT-4o & Opus.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
The table below breaks down the 79 solutions which were ultimately
=======
The table below breaks down the 79 solutions that were ultimately
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
=======
- Aider with GPT-4o immediately found 77% of the valid solutions on the first attempt.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
If we breakdown correct solutions purely by model,
=======
If we break down correct solutions purely by model,
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider instead uses a
[repository map](https://aider.chat/2023/10/22/repomap.html)
to help the LLM understand the
layout, code structure and content of a git repo.
The repo map is created from the code's AST and call graph
to provide a compact and powerful summary of the entire code base.
The map is constantly
tailored to show
repo context that is relevant to the current state of the chat conversation.
This is done by performing a graph optimization on the code's call graph.
=======
Aider instead uses a
[repository map](https://aider.chat/2023/10/22/repomap.html)
to help the LLM understand the
layout, code structure, and content of a git repo.
The repo map is created from the code's AST and call graph
to provide a compact and powerful summary of the entire code base.
The map is constantly
tailored to show
repo context that is relevant to the current state of the chat conversation.
This is done by performing a graph optimization on the code's call graph.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
When the user asks for a change to their code, the LLM uses the repo map
to decide which files to edit.
The LLM simply returns a normal text response explaining which files
it needs to edit and why.
Aider notices when the LLM mentions filenames from the repo,
and asks the user if they should be added to the chat.
Adding a file to the chat allows the LLM to see the full contents
of the file and edit it.
=======
When the user asks for a change to their code, the LLM uses the repo map
to decide which files to edit.
The LLM simply returns a normal text response explaining which files
it needs to edit and why.
Aider notices when the LLM mentions filenames from the repo,
and asks the user if they should be added to the chat.
Adding a file to the chat allows the LLM to see the full contents
of the file and edit it.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider has  always had a deep focus on ensuring that LLMs can not just write code,
but reliably *edit* code.
Aider a collection of prompting strategies and code editing backends which have
=======
Aider has always had a deep focus on ensuring that LLMs can not just write code,
but reliably *edit* code.
Aider has a collection of prompting strategies and code editing backends which have
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
[Aider lints code]()
after every LLM edit, and offers to automatically fix
=======
[Aider lints code](https://aider.chat/docs/linting/)
after every LLM edit and offers to automatically fix
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider includes basic linters built with tree-sitter that support
=======
Aider includes basic linters built with tree-sitter that supports
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
This context increases the ability of the LLM to understand the problem and
=======
This context increases the ability of the LLM to understand the problem and
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
A user working on a python project might do that by launching
=======
A user working on a Python project might do that by launching
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
The repositories that are used in the SWE Bench problems are large open
=======
The repositories that are used in the SWE Bench problems are large, open
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
A repo's test suite can be run in three ways:
=======
A repo's test suite can be run in three ways:
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
2. Run tests after aider has modified the repo.
=======
2. Run tests after aider has modified the repo.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
3. Run the final "acceptance tests" to judge if the coding agent has
=======
3. Run the final "acceptance tests" to judge if the coding agent has
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
For the benchmark, aider is configured with a test command that will run the tests
=======
For the benchmark, aider is configured with a test command that will run the tests
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
When aider runs a test command, it checks for a non-zero exit status.
=======
When aider runs a test command, it checks for a non-zero exit status.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
To be clear, *aider can not run or even see the "acceptance tests"* from the `test_patch`
=======
To be clear, *aider cannot run or even see the "acceptance tests"* from the `test_patch`
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
As aider executes, it notes the outcome of the editing, linting and testing
steps.
When aider completes, it returns their final status as either:
succeeded with no errors remaining,
or ended without resolving all errors.
=======
As aider executes, it notes the outcome of the editing, linting, and testing
steps.
When aider completes, it returns its final status as either:
succeeded with no errors remaining,
or ended without resolving all errors.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
The benchmark harness uses these outcomes to determine if it has a plausible
solution to the current SWE Bench task.
A plausible solution is one where aider
returns saying that it
edited the repo with no outstanding
edit, lint or test errors.
=======
The benchmark harness uses these outcomes to determine if it has a plausible
solution to the current SWE Bench task.
A plausible solution is one where aider
returns saying that it
edited the repo with no outstanding
edit, lint, or test errors.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
If the solution is not plausible, another
instance of aider is launched again from scratch on the same problem.
=======
If the solution is not plausible, another
instance of aider is launched again from scratch on the same problem.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained.
=======
It's worth noting that repositories may have lint or test errors present before aider even starts to edit them. Whether errors are caused by aider or were pre-existing, there will be instances where, after six tries, no plausible solution is obtained.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
If all six attempts fail to produce a plausible solution,
then the "best" solution available is selected as a the
=======
If all six attempts fail to produce a plausible solution,
then the "best" solution available is selected as the
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Which of the non-plausible solutions to use is determined
=======
Which of the non-plausible solutions to use is determined
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
The benchmark harness produces one "best" solution for each of the 300
SWE Bench Lite instances, and saves it as a `model_patch`.
A separate evaluation script uses the SWE Bench support code to
test each of these results with the acceptance tests.
=======
The benchmark harness produces one "best" solution for each of the 300
SWE Bench Lite instances and saves it as a `model_patch`.
A separate evaluation script uses the SWE Bench support code to
test each of these results with the acceptance tests.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
These `test_patch` acceptance tests are only ever run outside of aider
and the benchmark harness, and only to compute the number of
correctly resolved instances.
They are never run, used or even visible during the attempts to solve the problems.
=======
These `test_patch` acceptance tests are only ever run outside of aider
and the benchmark harness, and only to compute the number of
correctly resolved instances.
They are never run, used, or even visible during the attempts to solve the problems.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
=======
Aider correctly resolved 79 out of 300 SWE Bench Lite instances, or 26.3%.
>>>>>>> REPLACE
</source>

_posts/2024-05-22-swe-bench-lite.md
<source>markdown
<<<<<<< SEARCH
Much thanks to the team behind the
[SWE Bench](https://www.swebench.com)
family of AI coding benchmarks.
Also thanks to Albert Örwall who has
[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
making it faster, easier and more reliable to run the acceptance tests.
=======
Much thanks to the team behind the
[SWE Bench](https://www.swebench.com)
family of AI coding benchmarks.
Also thanks to Albert Örwall who has
[dockerized the SWE Bench evaluation scripts](SWE-bench-docker)
making it faster, easier, and more reliable to run the acceptance tests.
>>>>>>> REPLACE
</source>
2024-05-22 16:11:17 -07:00
Paul Gauthier
5b8bbdbe5f copy 2024-05-22 16:08:51 -07:00
Paul Gauthier
b67891e7f8 copy 2024-05-22 15:56:11 -07:00
Paul Gauthier
9f2554fed7 copy 2024-05-22 15:33:11 -07:00
Paul Gauthier
fcfc05e2d3 aider: Make the x and y axis labels also #555.
# Aider chat conversation:

USER: make the x and y axis labels also #555

ASSISTANT: Here are the changes to make the x and y axis labels also #555 in `benchmark/swe_bench_lite.py` and `benchmark/over_time.py`:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    ax.set_ylabel("Pass rate (%)", fontsize=18)
    ax.set_title("SWE Bench Lite", fontsize=20)
    ax.set_ylim(0, 30)
    plt.xticks(rotation=45, ha="right", fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.savefig("swe_bench_lite.jpg")
    plt.savefig("swe_bench_lite.svg")
    imgcat(fig)
=======
    ax.set_ylabel("Pass rate (%)", fontsize=18, color="#555")
    ax.set_title("SWE Bench Lite", fontsize=20)
    ax.set_ylim(0, 30)
    plt.xticks(rotation=45, ha="right", fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.savefig("swe_bench_lite.jpg")
    plt.savefig("swe_bench_lite.svg")
    imgcat(fig)
    ax.xaxis.label.set_color("#555")
>>>>>>> REPLACE
```

benchmark/over_time.py
```python
<<<<<<< SEARCH
    ax.set_xlabel("Model release date", fontsize=18)
    ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18)
    ax.set_title("LLM code editing skill by model release date", fontsize=20)
    ax.set_ylim(0, 30)
    plt.xticks(fontsize=14)
    plt.tight_layout(pad=3.0)
    plt.savefig("tmp_over_time.png")
    plt.savefig("tmp_over_time.svg")
    imgcat(fig)
=======
    ax.set_xlabel("Model release date", fontsize=18, color="#555")
    ax.set_ylabel("Aider code editing benchmark,\npercent completed correctly", fontsize=18, color="#555")
    ax.set_title("LLM code editing skill by model release date", fontsize=20)
    ax.set_ylim(0, 30)
    plt.xticks(fontsize=14)
    plt.tight_layout(pad=3.0)
    plt.savefig("tmp_over_time.png")
    plt.savefig("tmp_over_time.svg")
    imgcat(fig)
>>>>>>> REPLACE
```
2024-05-22 15:29:58 -07:00
Paul Gauthier
c3b8caf480 Update text color to "#555" in swe_bench_lite.py. 2024-05-22 15:29:57 -07:00
Paul Gauthier
1f7ff54717 aider: Set font color using plt.rcParams to resolve KeyError for unrecognized key "font.color".
# Aider chat conversation:

USER: KeyError: 'Unrecognized key "font.color" for group "font" and name "color"'

ASSISTANT: To fix the `KeyError` related to the unrecognized key "font.color", we need to set the font color using `plt.rcParams` instead of `rc`.

Here are the changes:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10, "color": "#444444"})
=======
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
    plt.rcParams["text.color"] = "#444444"
>>>>>>> REPLACE
```

benchmark/over_time.py
```python
<<<<<<< SEARCH
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10, "color": "#444444"})
=======
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
    plt.rcParams["text.color"] = "#444444"
>>>>>>> REPLACE
```
2024-05-22 15:28:44 -07:00
Paul Gauthier
81ac2e18cc aider: Changed font color to dark grey in benchmark/swe_bench_lite.py and benchmark/over_time.py.
# Aider chat conversation:

USER: make the font color dark grey not black

ASSISTANT: Here are the changes to make the font color dark grey instead of black in `benchmark/swe_bench_lite.py` and `benchmark/over_time.py`:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
=======
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10, "color": "#444444"})
>>>>>>> REPLACE
```

benchmark/over_time.py
```python
<<<<<<< SEARCH
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})
=======
    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10, "color": "#444444"})
>>>>>>> REPLACE
```
2024-05-22 15:28:30 -07:00
Paul Gauthier
5ba549834c Change the figure size in swe_bench_lite plot from 10x6 to 10x5. 2024-05-22 15:28:28 -07:00
Paul Gauthier
0b01b7caf5 copy 2024-05-22 15:26:16 -07:00
Paul Gauthier
19e7823859 aider: Add more whitespace around the perimeter.
# Aider chat conversation:

USER: add more whitespace around the perimeter

ASSISTANT: Here are the changes to add more whitespace around the perimeter in `benchmark/swe_bench_lite.py` and `benchmark/over_time.py`:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    plt.tight_layout()
=======
    plt.tight_layout(pad=3.0)
>>>>>>> REPLACE
```

benchmark/over_time.py
```python
<<<<<<< SEARCH
    plt.tight_layout()
=======
    plt.tight_layout(pad=3.0)
>>>>>>> REPLACE
```
2024-05-22 15:18:45 -07:00
Paul Gauthier
aac19f5911 Update xticks fontsize to 16 in swe_bench_lite.py. 2024-05-22 15:18:44 -07:00
Paul Gauthier
a6ae4cab06 aider: Make the model labels on the x axis larger.
# Aider chat conversation:

USER: make the model labels on the x axis larger

ASSISTANT: Here are the changes to make the model labels on the x-axis larger in `benchmark/swe_bench_lite.py` and `benchmark/over_time.py`:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    plt.xticks(rotation=45, ha='right')
=======
    plt.xticks(rotation=45, ha='right', fontsize=14)
>>>>>>> REPLACE
```

benchmark/over_time.py
```python
<<<<<<< SEARCH
    plt.tight_layout()
=======
    plt.xticks(fontsize=14)
    plt.tight_layout()
>>>>>>> REPLACE
```
2024-05-22 15:18:20 -07:00
Paul Gauthier
6765b12a7d Update plot_swe_bench_lite function to handle model names with '|' by replacing it with newline character and adjusting bar colors and transparency. 2024-05-22 15:18:18 -07:00
Paul Gauthier
5ef96a32e2 aider: Set the y-axis to go to 30.
# Aider chat conversation:

USER: make the y-axis go to 30

ASSISTANT: Here are the changes to make the y-axis go to 30 in `benchmark/swe_bench_lite.py`:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    plt.xticks(rotation=45, ha='right')
=======
    ax.set_ylim(0, 30)
    plt.xticks(rotation=45, ha='right')
>>>>>>> REPLACE
```

Here are the changes to make the y-axis go to 30 in `benchmark/over_time.py`:

benchmark/over_time.py
```python
<<<<<<< SEARCH
    plt.tight_layout()
=======
    ax.set_ylim(0, 30)
    plt.tight_layout()
>>>>>>> REPLACE
```
2024-05-22 15:14:38 -07:00
Paul Gauthier
d1bbcb9bf8 Change pass rate label to lowercase and update colors in SWE Bench Lite plot. 2024-05-22 15:14:37 -07:00
Paul Gauthier
5761218b9a aider: Add imgcat to display the graph after saving it.
# Aider chat conversation:

USER: save it as a png and svg and imgcat it out

ASSISTANT: To save the graph as a PNG and SVG and then display it using `imgcat`, we need to modify `benchmark/swe_bench_lite.py`:

1. Import `imgcat`.
2. Add `imgcat` to display the graph after saving it.

Here are the *SEARCH/REPLACE* blocks:

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
from matplotlib import rc
=======
from matplotlib import rc
from imgcat import imgcat
>>>>>>> REPLACE
```

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
    plt.show()
=======
    imgcat(fig)
    plt.show()
>>>>>>> REPLACE
```
2024-05-22 15:13:01 -07:00
Paul Gauthier
c9349d6e23 Updated SWE Bench Lite to display pass rates by model with adjusted colors and title. 2024-05-22 15:13:00 -07:00
Paul Gauthier
61019093a2 aider: Added code to plot a vertical bar graph of data from tmp.txt in swe_bench_lite.py.
# Aider chat conversation:

USER: plot a vertical bar graph of the data in tmp.txt. use similar fonts, colors and graph style to the one in over_time.py. put the new graph code in swe_bench_lite.py

ASSISTANT: Here is the code to plot a vertical bar graph of the data in `tmp.txt` using similar fonts, colors, and graph style to the one in `over_time.py`. This code will be placed in `swe_bench_lite.py`.

benchmark/swe_bench_lite.py
```python
<<<<<<< SEARCH
=======
import matplotlib.pyplot as plt
from matplotlib import rc

def plot_swe_bench_lite(data_file):
    with open(data_file, "r") as file:
        lines = file.readlines()

    models = []
    pass_rates = []

    for line in lines:
        if line.strip():
            pass_rate, model = line.split("%")
            models.append(model.strip())
            pass_rates.append(float(pass_rate.strip()))

    plt.rcParams["hatch.linewidth"] = 0.5
    plt.rcParams["hatch.color"] = "#444444"

    rc("font", **{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10})

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.grid(axis="y", zorder=0, lw=0.2)
    for spine in ax.spines.values():
        spine.set_edgecolor("#DDDDDD")
        spine.set_linewidth(0.5)

    colors = [
        "red" if "Aider" in model else "green" if "AutoCodeRover" in model else "blue" for model in models
    ]
    bars = ax.bar(models, pass_rates, color=colors, alpha=0.5, zorder=3)

    for bar in bars:
        yval = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, yval + 0.5, f'{yval}%', ha='center', va='bottom', fontsize=12, alpha=0.75)

    ax.set_xlabel("Models", fontsize=18)
    ax.set_ylabel("Pass Rate (%)", fontsize=18)
    ax.set_title("SWE Benchmark Pass Rates by Model", fontsize=20)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig("swe_bench_lite.png")
    plt.savefig("swe_bench_lite.svg")
    plt.show()

# Example usage
plot_swe_bench_lite("benchmark/tmp.txt")
>>>>>>> REPLACE
```
2024-05-22 15:11:22 -07:00
Paul Gauthier
db30e5da2b Added draft post 2024-05-22 15:03:11 -07:00
Paul Gauthier
d94da4f809 draft posts 2024-05-22 14:34:12 -07:00
Paul Gauthier
81ec0515bb copy 2024-05-22 12:24:34 -07:00
Paul Gauthier
2374a2166f set version to 0.36.1-dev 2024-05-22 12:16:33 -07:00
Paul Gauthier
8868caf2b1 version bump to 0.36.0 2024-05-22 12:15:53 -07:00
Paul Gauthier
13ada38ecc copy 2024-05-22 12:15:40 -07:00
Paul Gauthier
589f1d510e copy 2024-05-22 12:12:30 -07:00
Paul Gauthier
c74c297aa2 copy 2024-05-22 12:09:35 -07:00
Paul Gauthier
29e92b2398 Updated HISTORY, added linting image 2024-05-22 12:08:50 -07:00
Paul Gauthier
210679f020 Use the test_cmd by default 2024-05-22 12:05:29 -07:00
Paul Gauthier
61ac9b7991 HISTORY 2024-05-22 11:55:14 -07:00
Paul Gauthier
47810310c9 Added linting article 2024-05-22 11:55:10 -07:00
Paul Gauthier
5032ce1a95 bump deps for +grep-ast==0.3.2 2024-05-20 09:19:02 -07:00
Paul Gauthier
57a2c61386 updated lint cmd descriptions 2024-05-20 09:01:22 -07:00
Paul Gauthier
ee516e3958 Added import statement for regular expressions in linter module. 2024-05-20 08:56:42 -07:00
Paul Gauthier
ce5dabd5f1 Removed unused import statement for 're' module. 2024-05-20 08:56:30 -07:00
Paul Gauthier
a115e594c0 Add import statement for 're' module and fix regex pattern in linter.py. 2024-05-20 08:56:21 -07:00
Paul Gauthier
9c28bbc98e Refactored linting process and error handling in commands. 2024-05-20 08:55:49 -07:00
Paul Gauthier
deb13c060c copy 2024-05-19 15:14:22 -07:00
Paul Gauthier
04539feccb bump deps to get grep-ast 0.3.1 2024-05-19 15:10:57 -07:00
Paul Gauthier
5745657991 ask to fix lint/test 2024-05-19 15:02:16 -07:00
Paul Gauthier
038806f694 max_reflex=3, auto fix lint/test 2024-05-19 14:24:06 -07:00
Paul Gauthier
642be2a8f0 S/R uniq prompt 2024-05-19 14:22:29 -07:00