From 8b5451f4abaf11b543e7a24c25488598451cc981 Mon Sep 17 00:00:00 2001 From: Paul Gauthier Date: Wed, 22 May 2024 20:15:46 -0700 Subject: [PATCH] copy --- _posts/2024-05-22-swe-bench-lite.md | 35 +++++++++++++++-------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/_posts/2024-05-22-swe-bench-lite.md b/_posts/2024-05-22-swe-bench-lite.md index fd0bf04b3..a69321ee3 100644 --- a/_posts/2024-05-22-swe-bench-lite.md +++ b/_posts/2024-05-22-swe-bench-lite.md @@ -63,8 +63,9 @@ SWE Bench problem was correctly resolved. The benchmarking process was similar to a user employing aider like this: -- Launching aider in their repo with the something like command below, which -tells aider to say yes to every suggestion and use pytest to run tests. +- Launching aider in their repo with the command below, which +tells aider to automatically proceed with every suggestion +and use pytest to run tests. - `aider --yes --test-cmd pytest` - Pasting the text of a GitHub issue into the chat, or adding it via URL with a command in the chat like: - `/web https://github.com/django/django/issues/XXX` @@ -267,7 +268,7 @@ aider like this: aider --test-cmd pytest ``` -The repositories that are used in the SWE Bench problems are large, open +The repositories that are used in the SWE Bench problems are large open source projects with extensive existing test suites. A repo's test suite can be run in three ways: @@ -275,30 +276,30 @@ A repo's test suite can be run in three ways: 2. Run tests after aider has modified the repo. So the pre-existing test cases are still present, but may have been modified by aider. Aider may have also added new tests. -3. Run the final "acceptance tests" to judge if aider has -successfully resolved the problem. -SWE Bench verifies both pre-existing tests and a set of held out acceptance tests -(from the so called `test_patch`) -to check that the issue is properly resolved. During this final acceptance testing, -any aider edits to tests are discarded to ensure a faithful test of whether the -issue was resolved. +3. Run the final "acceptance tests" to judge if aider has successfully resolved the problem. +These tests include the unmodified pre-existing tests and +a held out set of tests (from the so called `test_patch`). For the benchmark, aider is configured with a test command that will run the tests as described in (2) above. So testing will fail if aider has broken any pre-existing tests or if any new tests that it created aren't passing. -When aider runs a test command, it checks for a non-zero exit status. -In this case, -aider will automatically +If any tests fail, aider will automatically share the test output with the LLM and ask it to try and resolve the test failures. To be clear, *aider cannot run or even see the "acceptance tests"* from the `test_patch` -as described in (3). +described in (3). Those tests are only run outside of aider and the benchmark harness, to compute the final benchmark score. - - +To do that, +the SWE Bench support code +verifies that the pre-existing and held out tests +pass as expected from a correct solution. +If so, the issue is marked as resolved. +For this final acceptance testing, +any aider edits to tests are discarded to ensure a faithful determination +of whether the issue was resolved. ## Finding a plausible solution @@ -320,7 +321,7 @@ as the SWE Bench `model_patch` to be evaluated later with the If the solution is not plausible, another instance of aider is launched again from scratch on the same problem. -The harness alternates asking GPT-4o and Opus to solve the problem, +The harness alternates launching aider with GPT-4o and Opus to solve the problem, and gives each model three attempts -- for a total of six attempts. As soon as a plausible solution is found, it is accepted and the harness moves on to the next SWE Bench instance.