diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index 64bf71ea3..b9ff99d12 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -13,7 +13,7 @@ on the achieving a state-of-the-art result. The current top leaderboard entry is 20.3% from Amazon Q Developer Agent. -The best result reported elsewhere online seems to be +The best result reported elsewhere seems to be [22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover). [![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg) @@ -94,26 +94,29 @@ that used aider with both GPT-4o & Opus. The benchmark harness alternated between running aider with GPT-4o and Opus. The harness proceeded in a fixed order, always starting with GPT-4o and -then alternating with Opus until a plausible solution was found. +then alternating with Opus until a plausible solution was found for each +problem. The table below breaks down the 79 solutions that were ultimately verified as correctly resolving their issue. Some noteworthy observations: -- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resulted problems. +- Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. +- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resolved problems. - ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus. -- A long tail of solutions continued to be found by both models including one resolved solution on the final, sixth attempt of that problem. +- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem. -| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent
of correctly
resolved | -|:--------:|------------|---------:|---------:|----:|---:| -| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | -| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | -| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | -| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | -| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | -| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | -| **Total** | | **300** | **100%** | **79** | **100%** | +| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent of
correctly
resolved | Percent of
SWE Bench Lite Resolved | +|:--------:|------------|---------:|---------:|----:|---:|--:| +| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% | +| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% | +| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% | +| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% | +| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% | +| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% | +| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** | + If we break down correct solutions purely by model, we can see that aider with GPT-4o outperforms Opus. diff --git a/assets/swe_bench_lite.jpg b/assets/swe_bench_lite.jpg index c604a2ad3..c940d2604 100644 Binary files a/assets/swe_bench_lite.jpg and b/assets/swe_bench_lite.jpg differ diff --git a/assets/swe_bench_lite.svg b/assets/swe_bench_lite.svg index b115d4164..fe7cecc1a 100644 --- a/assets/swe_bench_lite.svg +++ b/assets/swe_bench_lite.svg @@ -6,7 +6,7 @@ - 2024-05-22T20:23:36.416838 + 2024-05-23T07:38:15.931243 image/svg+xml @@ -41,12 +41,12 @@ z - - + @@ -453,7 +453,7 @@ z - + @@ -479,7 +479,7 @@ z - + @@ -601,7 +601,7 @@ z - + @@ -674,7 +674,7 @@ z - + @@ -886,7 +886,7 @@ z - + @@ -1007,7 +1007,7 @@ z - + @@ -1043,21 +1043,21 @@ z +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - - + - + +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + - + +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1167,11 +1167,11 @@ z +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1186,11 +1186,11 @@ L 690 158.200121 +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1232,11 +1232,11 @@ z +" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - + @@ -1248,9 +1248,40 @@ L 690 81.200034 - - + + + + - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + @@ -1368,10 +1408,10 @@ L 690 50.4 +" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/> +" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/> - - + + - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + @@ -1658,30 +1696,6 @@ Q 3319 0 2413 0 L 472 0 L 472 4591 z -" transform="scale(0.015625)"/> - + diff --git a/benchmark/swe_bench_lite.py b/benchmark/swe_bench_lite.py index c67f2c47a..26aca75f6 100644 --- a/benchmark/swe_bench_lite.py +++ b/benchmark/swe_bench_lite.py @@ -47,7 +47,7 @@ def plot_swe_bench_lite(data_file): ) # ax.set_xlabel("Models", fontsize=18) - ax.set_ylabel("Pass rate (%)", fontsize=18, color=font_color) + ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color) ax.set_title("SWE Bench Lite", fontsize=20) ax.set_ylim(0, 29) plt.xticks(