copy

2025-06-03 11:14:59 +00:00 · 2024-06-02 06:28:56 -07:00 · 2024-06-02 06:28:56 -07:00 · ad320e085d
commit ad320e085d
parent e5c831d1b6
1 changed files with 13 additions and 20 deletions
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@ -16,16 +16,16 @@ from Amazon Q Developer Agent.
 The best result reported elsewhere seems to be
 [13.9% from Devin](https://www.cognition.ai/post/swe-bench-technical-report).
-This result on the main SWE Bench is in addition to
+This result on the main SWE Bench builds on
 [aider's recent SOTA result on the easier SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
 [![SWE Bench results](/assets/swe_bench.svg)](https://aider.chat/assets/swe_bench.svg)
 Aider was benchmarked on the same
-[randomly selected 570 problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
+[570 randomly selected SWE Bench problems](https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs)
-from SWE Bench that were used in the
+that were used in the
 [Devin evaluation](https://www.cognition.ai/post/swe-bench-technical-report).
-Please see the [references](#references)
+See the [references](#references)
 for more details on the data presented in this chart.
 ## Interactive, not agentic
@ -76,21 +76,13 @@ This is the same approach
 that was used for
 [aider's recent SOTA result on SWE Bench Lite](https://aider.chat/2024/05/22/swe-bench-lite.html).
 For the Lite benchmark,
-aider alternated between GPT-4o and Opus for up to 6 total attempts.
+aider alternated between GPT-4o and Opus for up to six total attempts.
-Due to the increased token costs involved in running
+To manage the cost of running the main SWE Bench benchmark,
-the main SWE Bench benchmark, aider was limited to 2 total attempts:
+aider was limited to two total attempts:
-one attempt of aider with GPT-4o and one with Opus.
+one with GPT-4o and one with Opus.
 The problems from the main SWE Bench dataset
 are more difficult and involved edits to
 multiple source files,
 which increased the token costs as compared to Lite.
 Further, aider was benchmarked on 570 SWE Bench problems
 versus only 300 Lite problems,
 adding another factor of ~two to the costs.
 For a detailed discussion of the benchmark
-methodology, please see the
+methodology, see the
 [article about aider's SWE Bench Lite results](https://aider.chat/2024/05/22/swe-bench-lite.html).
 Also, the
 [aider SWE Bench repository on GitHub](https://github.com/paul-gauthier/aider-swe-bench)
@ -107,7 +99,7 @@ and to use pytest to run tests.
 Aider will pull in the URL's content and then try and resolve the issue.
 - If aider doesn't produce code that lints and tests clean, the user might decide to
 [use git to revert the changes](https://aider.chat/docs/faq.html#how-does-aider-use-git),
-and try again with `aider --opus`. Many aider users employ this strategy.
+and try again with `aider --opus`.
 ## Aider with GPT-4o alone was SOTA
@ -195,9 +187,10 @@ increase the number of resolved problems.
 New solutions may resolve some new problems but they may also
 eclipse and discard some of the previous non-plausible correct solutions.
-Luckily, additional attempts usually provide a net increase in the overall
+Luckily, the net effect of additional attempts
 usually increases or at least maintains the
 number of resolved solutions.
-This was the case for both this main SWE Bench result and the
+This was the case for all the attempts made in both this main SWE Bench result and the
 earlier Lite result.
 ## Computing the benchmark score