diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index f45c355c6..3cddc6e59 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -162,7 +162,7 @@ The SWE Bench acceptance testing just confirms that tests pass or fail in the same pattern as the "gold patch" developed by a human to solve the problem. Some tests may fail during acceptance testing, -and that's ok as long they failed for the gold +and that's ok as long as they failed for the gold patch too. - There may have been pre-existing linting problems in the repo. If lingering linting issues affected code paths that are not well tested, @@ -200,7 +200,7 @@ This was the case for both this main SWE Bench result and the earlier Lite result. The table below breaks down the benchmark outcome of each problem, -show whether aider with GPT-4o and with Opus +showing whether aider with GPT-4o and with Opus produced plausible and/or correct solutions. |Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome| @@ -304,7 +304,7 @@ Table 2 of their reports an `ACR-avg` result of 10.59% which is an average pass@1 result. The results presented here for aider are all pass@1, as -the [official SWE Bench Lite leaderboard](https://www.swebench.com) +the [official SWE Bench leaderboard](https://www.swebench.com) only accepts pass@1 results.