From 2a881faac8d578a69edd2759354ceddcabbc24d2 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Thu, 23 May 2024 07:51:06 -0700 Subject: [PATCH] copy --- _posts/2024-05-22-swe-bench-lite.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index b9ff99d12..d51a33ea5 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -102,12 +102,12 @@ verified as correctly resolving their issue. Some noteworthy observations: - Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. -- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resolved problems. -- ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus. +- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results. +These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions. - A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem. -| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent of
correctly
resolved | Percent of
SWE Bench Lite Resolved | +| Attempt | Agent |Number
plausible
solutions|Percent of
plausible
solutions| Number
correctly
resolved | Percent of
correctly
resolved | Score on
SWE Bench
Lite | |:--------:|------------|---------:|---------:|----:|---:|--:| | 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% | | 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |