From faf15188c5d18f2721d8236c0fe7d22567b26e50 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Sat, 25 May 2024 18:58:42 -0700 Subject: [PATCH] copy --- _posts/2024-05-22-swe-bench-lite.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index 5c37cf04f..a584f9e42 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -109,7 +109,7 @@ verified as correctly resolving their issue. Some noteworthy observations: - *Just the first attempt* of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard. -- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark, better than all other known results. +- Including the second attempt, Aider with GPT-4o and Opus scored 23.6% on the benchmark. These first two attempts obtained ~75% of all plausible and ~90% of all resolved solutions. - A long tail of solutions continued to be found using both models including one correctly resolved solution on the final, sixth attempt of that problem.