copy

2025-06-01 18:25:00 +00:00 · 2024-05-22 15:26:16 -07:00 · 2024-05-22 15:26:16 -07:00 · 0b01b7caf5
commit 0b01b7caf5
parent 19e7823859
4 changed files with 1660 additions and 14 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -1,12 +1,15 @@
 ---
 title: Aider scores SOTA 26.3% on SWE Bench Lite
 excerpt: Aider scored 26.3% on SWE Bench Lite, achieving a state of the art result.
+highlight_image: /assets/swe_bench_lite.jpg
 draft: true
 ---

+[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
+
 # Aider scores SOTA 26.3% on SWE Bench Lite
 
-[Aider scored 26.3%]()
+Aider scored 26.3%
 on the
 [SWE Bench Lite benchmark](https://www.swebench.com), achieving a state of the art result. 
 The current top leaderboard entry is 20.33%
@ -14,6 +17,8 @@ from Amazon Q Developer Agent.
 The best result reported elsewhere online seems to be
 [22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).

+## Interactive, not agentic
+
 Aider achieved this result mainly through its focus on static code analysis,
 reliable LLM code editing
 and pragmatic workflows for interactive pair programming with AI.
@ -33,6 +38,8 @@ When a user asks aider for a change, they see the edits performed in real-time.
 Aider may also then offer additional
 help like fixing lint or test errors.

+## Methodology
+
 For the benchmark, 
 aider was launched in each problem's git repository
 with the problem statement
@ -113,7 +120,7 @@ Some noteworthy observations:
 | 6 | Opus   |  1 |  1.3 | 100.0
 |**Total**|   | **79** | **100%** | **100%** |

-If we just look at which models produced correct solutions,
+If we breakdown correct solutions purely by model,
 we can see that GPT-4o dominates.
 This isn't a fair comparison, because GPT-4o always took the first
 attempt at solving.
@ -145,8 +152,7 @@ to provide a compact and powerful summary of the entire code base.
 The map is constantly
 tailored to show
 repo context that is relevant to the current state of the chat conversation.
-
-by performing a graph optimization on the code's call graph.
+This is done by performing a graph optimization on the code's call graph.

 When the user asks for a change to their code, the LLM uses the repo map
 to decide which files to edit.