copy

2025-06-11 23:25:01 +00:00 · 2024-05-24 07:07:14 -07:00 · 2024-05-24 07:07:14 -07:00 · 37c640bf69
commit 37c640bf69
parent c591ecd331
2 changed files with 38 additions and 35 deletions
--- a/_posts/2024-05-22-swe-bench-lite.md
+++ b/_posts/2024-05-22-swe-bench-lite.md
@ -7,7 +7,7 @@ draft: true

 # Aider scores 26.3% on SWE Bench Lite
 
-Aider scored 26.3%
+[Aider scored 26.3%](https://github.com/swe-bench/experiments/pull/7)
 on the
 [SWE Bench Lite benchmark](https://www.swebench.com),
 achieving a state-of-the-art result. 
@ -195,7 +195,7 @@ and it worked well for the SWE Bench problems.
 Aider successfully identified the correct file to edit
 in 70.3% of the benchmark tasks.

-We can determine which file needed to be edited using the "gold" patch
+We can determine which file needs to be edited using the "gold" patch
 which is associated with each SWE Bench task.
 This patch was created by a human developer
 to solve the issue, and therefore reveals a file which can
@ -237,7 +237,7 @@ created a plausible solution.
 ## Linting and fixing

 Another key criteria for a plausible solution is that it passes basic
-linting, which means that the code is valid and without syntax
+linting, which means that the code has no syntax
 or other fatal errors.
 [Aider lints code](https://aider.chat/2024/05/22/linting.html)
 after every LLM edit and offers to automatically fix
@ -365,15 +365,16 @@ and prioritizing solutions in the following order:

 ## Computing the benchmark score

-The benchmark harness produces a candidate solution for each of the 300
-SWE Bench Lite instances and saves it as the `model_patch`.
+The benchmark harness produced a plausible solution for each of the 300
+SWE Bench Lite instances and saved it as the `model_patch`.

-A separate evaluation script 
-tests each of these solutions with the full test suite
+A separate evaluation script was used to
+test each of these solutions with the full test suite,
 including the held out acceptance tests.
 For this final acceptance testing, any edits that aider made to tests
 are discarded.
-This ensures that the full, correct test suite is used for acceptance testing.
+This ensures that the correct,
+unmodified test suite is used for acceptance testing.
 The evaluation script compares the test results
 with results from testing
 the "gold" patch that was developed by a human to correctly solve the issue.