diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md
index 64bf71ea3..b9ff99d12 100644
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@@ -13,7 +13,7 @@ on the
achieving a state-of-the-art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
-The best result reported elsewhere online seems to be
+The best result reported elsewhere seems to be
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
[](https://aider.chat/assets/swe_bench_lite.svg)
@@ -94,26 +94,29 @@ that used aider with both GPT-4o & Opus.
The benchmark harness alternated between running aider with GPT-4o and Opus.
The harness proceeded in a fixed order, always starting with GPT-4o and
-then alternating with Opus until a plausible solution was found.
+then alternating with Opus until a plausible solution was found for each
+problem.
The table below breaks down the 79 solutions that were ultimately
verified as correctly resolving their issue.
Some noteworthy observations:
-- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resulted problems.
+- Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
+- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resolved problems.
- ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus.
-- A long tail of solutions continued to be found by both models including one resolved solution on the final, sixth attempt of that problem.
+- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
-| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent
of correctly
resolved |
-|:--------:|------------|---------:|---------:|----:|---:|
-| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% |
-| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% |
-| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% |
-| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% |
-| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% |
-| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% |
-| **Total** | | **300** | **100%** | **79** | **100%** |
+| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent of
correctly
resolved | Percent of
SWE Bench Lite Resolved |
+|:--------:|------------|---------:|---------:|----:|---:|--:|
+| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
+| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
+| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
+| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
+| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
+| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
+| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
+
If we break down correct solutions purely by model,
we can see that aider with GPT-4o outperforms Opus.
diff --git a/assets/swe_bench_lite.jpg b/assets/swe_bench_lite.jpg
index c604a2ad3..c940d2604 100644
Binary files a/assets/swe_bench_lite.jpg and b/assets/swe_bench_lite.jpg differ
diff --git a/assets/swe_bench_lite.svg b/assets/swe_bench_lite.svg
index b115d4164..fe7cecc1a 100644
--- a/assets/swe_bench_lite.svg
+++ b/assets/swe_bench_lite.svg
@@ -6,7 +6,7 @@
- 2024-05-22T20:23:36.416838
+ 2024-05-23T07:38:15.931243
image/svg+xml
@@ -41,12 +41,12 @@ z
-
-
+
@@ -453,7 +453,7 @@ z
-
+
@@ -479,7 +479,7 @@ z
-
+
@@ -601,7 +601,7 @@ z
-
+
@@ -674,7 +674,7 @@ z
-
+
@@ -886,7 +886,7 @@ z
-
+
@@ -1007,7 +1007,7 @@ z
-
+
@@ -1043,21 +1043,21 @@ z
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
-
+
-
+
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
-
+
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1167,11 +1167,11 @@ z
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1186,11 +1186,11 @@ L 690 158.200121
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1232,11 +1232,11 @@ z
+" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
+
@@ -1248,9 +1248,40 @@ L 690 81.200034
-
-
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
@@ -1368,10 +1408,10 @@ L 690 50.4
+" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
+" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
-
-
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
@@ -1658,30 +1696,6 @@ Q 3319 0 2413 0
L 472 0
L 472 4591
z
-" transform="scale(0.015625)"/>
-
+
diff --git a/benchmark/swe_bench_lite.py b/benchmark/swe_bench_lite.py
index c67f2c47a..26aca75f6 100644
--- a/benchmark/swe_bench_lite.py
+++ b/benchmark/swe_bench_lite.py
@@ -47,7 +47,7 @@ def plot_swe_bench_lite(data_file):
)
# ax.set_xlabel("Models", fontsize=18)
- ax.set_ylabel("Pass rate (%)", fontsize=18, color=font_color)
+ ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color)
ax.set_title("SWE Bench Lite", fontsize=20)
ax.set_ylim(0, 29)
plt.xticks(