This commit is contained in:
Paul Gauthier 2024-05-23 07:45:54 -07:00
parent d9594815b0
commit 15c228097b
4 changed files with 128 additions and 111 deletions

View file

@ -13,7 +13,7 @@ on the
achieving a state-of-the-art result.
The current top leaderboard entry is 20.3%
from Amazon Q Developer Agent.
The best result reported elsewhere online seems to be
The best result reported elsewhere seems to be
[22.3% from AutoCodeRover](https://github.com/nus-apr/auto-code-rover).
[![SWE Bench Lite results](/assets/swe_bench_lite.svg)](https://aider.chat/assets/swe_bench_lite.svg)
@ -94,26 +94,29 @@ that used aider with both GPT-4o & Opus.
The benchmark harness alternated between running aider with GPT-4o and Opus.
The harness proceeded in a fixed order, always starting with GPT-4o and
then alternating with Opus until a plausible solution was found.
then alternating with Opus until a plausible solution was found for each
problem.
The table below breaks down the 79 solutions that were ultimately
verified as correctly resolving their issue.
Some noteworthy observations:
- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resulted problems.
- Just the first attempt of Aider with GPT-4o resolved 20.3% of the problems, which ties the Amazon Q Developer Agent currently atop the official leaderboard.
- Aider with GPT-4o on the first attempt immediately found 69% of all plausible solutions which accounted for 77% of the correctly resolved problems.
- ~75% of all plausible and ~90% of all resolved solutions were found after one attempt from aider with GPT-4o and Opus.
- A long tail of solutions continued to be found by both models including one resolved solution on the final, sixth attempt of that problem.
- A long tail of solutions continued to be found by both models including one correctly resolved solution on the final, sixth attempt of that problem.
| Attempt | Agent |Number<br>plausible<br>solutions|Percent of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent<br>of correctly<br>resolved |
|:--------:|------------|---------:|---------:|----:|---:|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% |
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% |
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% |
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% |
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% |
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% |
| **Total** | | **300** | **100%** | **79** | **100%** |
| Attempt | Agent |Number<br>plausible<br>solutions|Percent&nbsp;of<br>plausible<br>solutions| Number<br/>correctly<br>resolved | Percent&nbsp;of<br>correctly<br>resolved | Percent of<br>SWE Bench Lite&nbsp;Resolved |
|:--------:|------------|---------:|---------:|----:|---:|--:|
| 1 | Aider with GPT-4o | 208 | 69.3% | 61 | 77.2% | 20.3% |
| 2 | Aider with Opus | 49 | 16.3% | 10 | 12.7% | 3.3% |
| 3 | Aider with GPT-4o | 20 | 6.7% | 3 | 3.8% | 1.0% |
| 4 | Aider with Opus | 9 | 3.0% | 2 | 2.5% | 0.7% |
| 5 | Aider with GPT-4o | 11 | 3.7% | 2 | 2.5% | 0.7% |
| 6 | Aider with Opus | 3 | 1.0% | 1 | 1.3% | 0.3% |
| **Total** | | **300** | **100%** | **79** | **100%** | **26.3%** |
If we break down correct solutions purely by model,
we can see that aider with GPT-4o outperforms Opus.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 37 KiB

Before After
Before After

View file

@ -6,7 +6,7 @@
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cc:Work>
<dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
<dc:date>2024-05-22T20:23:36.416838</dc:date>
<dc:date>2024-05-23T07:38:15.931243</dc:date>
<dc:format>image/svg+xml</dc:format>
<dc:creator>
<cc:Agent>
@ -41,12 +41,12 @@ z
<g id="xtick_1">
<g id="line2d_1">
<defs>
<path id="m1c7d4f1d06" d="M 0 0
<path id="m13d95e4709" d="M 0 0
L 0 3.5
" style="stroke: #000000; stroke-width: 0.8"/>
</defs>
<g>
<use xlink:href="#m1c7d4f1d06" x="130.142981" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="130.142981" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_1">
@ -453,7 +453,7 @@ z
<g id="xtick_2">
<g id="line2d_2">
<g>
<use xlink:href="#m1c7d4f1d06" x="213.207821" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="213.207821" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_2">
@ -479,7 +479,7 @@ z
<g id="xtick_3">
<g id="line2d_3">
<g>
<use xlink:href="#m1c7d4f1d06" x="296.27266" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="296.27266" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_3">
@ -601,7 +601,7 @@ z
<g id="xtick_4">
<g id="line2d_4">
<g>
<use xlink:href="#m1c7d4f1d06" x="379.3375" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="379.3375" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_4">
@ -674,7 +674,7 @@ z
<g id="xtick_5">
<g id="line2d_5">
<g>
<use xlink:href="#m1c7d4f1d06" x="462.40234" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="462.40234" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_5">
@ -886,7 +886,7 @@ z
<g id="xtick_6">
<g id="line2d_6">
<g>
<use xlink:href="#m1c7d4f1d06" x="545.467179" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="545.467179" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_6">
@ -1007,7 +1007,7 @@ z
<g id="xtick_7">
<g id="line2d_7">
<g>
<use xlink:href="#m1c7d4f1d06" x="628.532019" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#m13d95e4709" x="628.532019" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_7">
@ -1043,21 +1043,21 @@ z
<g id="line2d_8">
<path d="M 68.675 273.70025
L 690 273.70025
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_9">
<defs>
<path id="mb9d6d72965" d="M 0 0
<path id="mb0b2eca59c" d="M 0 0
L -3.5 0
" style="stroke: #000000; stroke-width: 0.8"/>
</defs>
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="273.70025" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_8">
<!-- 0 -->
<g transform="translate(56.114062 277.286969) scale(0.1 -0.1)">
<g transform="translate(56.114063 277.286969) scale(0.1 -0.1)">
<defs>
<path id="Helvetica-30" d="M 1731 4475
Q 2600 4475 2988 3759
@ -1089,16 +1089,16 @@ z
<g id="line2d_10">
<path d="M 68.675 235.200207
L 690 235.200207
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_11">
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="235.200207" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="235.200207" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_9">
<!-- 5 -->
<g transform="translate(56.114062 238.786926) scale(0.1 -0.1)">
<g transform="translate(56.114063 238.786926) scale(0.1 -0.1)">
<defs>
<path id="Helvetica-35" d="M 791 1141
Q 847 659 1238 475
@ -1135,11 +1135,11 @@ z
<g id="line2d_12">
<path d="M 68.675 196.700164
L 690 196.700164
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_13">
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="196.700164" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="196.700164" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_10">
@ -1167,11 +1167,11 @@ z
<g id="line2d_14">
<path d="M 68.675 158.200121
L 690 158.200121
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_15">
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="158.200121" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="158.200121" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_11">
@ -1186,11 +1186,11 @@ L 690 158.200121
<g id="line2d_16">
<path d="M 68.675 119.700078
L 690 119.700078
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_17">
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="119.700078" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="119.700078" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_12">
@ -1232,11 +1232,11 @@ z
<g id="line2d_18">
<path d="M 68.675 81.200034
L 690 81.200034
" clip-path="url(#p4afbc1300d)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
" clip-path="url(#p535a156c8f)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
</g>
<g id="line2d_19">
<g>
<use xlink:href="#mb9d6d72965" x="68.675" y="81.200034" style="stroke: #000000; stroke-width: 0.8"/>
<use xlink:href="#mb0b2eca59c" x="68.675" y="81.200034" style="stroke: #000000; stroke-width: 0.8"/>
</g>
</g>
<g id="text_13">
@ -1248,9 +1248,40 @@ L 690 81.200034
</g>
</g>
<g id="text_14">
<!-- Pass rate (%) -->
<g style="fill: #555555" transform="translate(42.80125 216.562) rotate(-90) scale(0.18 -0.18)">
<!-- Instances resolved (%) -->
<g style="fill: #555555" transform="translate(42.80125 253.582937) rotate(-90) scale(0.18 -0.18)">
<defs>
<path id="Helvetica-49" d="M 628 4591
L 1256 4591
L 1256 0
L 628 0
L 628 4591
z
" transform="scale(0.015625)"/>
<path id="Helvetica-63" d="M 1703 3444
Q 2269 3444 2623 3169
Q 2978 2894 3050 2222
L 2503 2222
Q 2453 2531 2275 2736
Q 2097 2941 1703 2941
Q 1166 2941 934 2416
Q 784 2075 784 1575
Q 784 1072 996 728
Q 1209 384 1666 384
Q 2016 384 2220 598
Q 2425 813 2503 1184
L 3050 1184
Q 2956 519 2581 211
Q 2206 -97 1622 -97
Q 966 -97 575 383
Q 184 863 184 1581
Q 184 2463 612 2953
Q 1041 3444 1703 3444
z
M 1616 3428
L 1616 3428
z
" transform="scale(0.015625)"/>
<path id="Helvetica-28" d="M 1894 4666
Q 1403 3713 1256 3263
Q 1034 2578 1034 1681
@ -1329,19 +1360,28 @@ L 222 -1306
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#Helvetica-50"/>
<use xlink:href="#Helvetica-61" x="66.699219"/>
<use xlink:href="#Helvetica-73" x="122.314453"/>
<use xlink:href="#Helvetica-73" x="172.314453"/>
<use xlink:href="#Helvetica-20" x="222.314453"/>
<use xlink:href="#Helvetica-72" x="250.097656"/>
<use xlink:href="#Helvetica-61" x="283.398438"/>
<use xlink:href="#Helvetica-74" x="339.013672"/>
<use xlink:href="#Helvetica-65" x="366.796875"/>
<use xlink:href="#Helvetica-20" x="422.412109"/>
<use xlink:href="#Helvetica-28" x="450.195312"/>
<use xlink:href="#Helvetica-25" x="483.496094"/>
<use xlink:href="#Helvetica-29" x="572.412109"/>
<use xlink:href="#Helvetica-49"/>
<use xlink:href="#Helvetica-6e" x="27.783203"/>
<use xlink:href="#Helvetica-73" x="83.398438"/>
<use xlink:href="#Helvetica-74" x="133.398438"/>
<use xlink:href="#Helvetica-61" x="161.181641"/>
<use xlink:href="#Helvetica-6e" x="216.796875"/>
<use xlink:href="#Helvetica-63" x="272.412109"/>
<use xlink:href="#Helvetica-65" x="322.412109"/>
<use xlink:href="#Helvetica-73" x="378.027344"/>
<use xlink:href="#Helvetica-20" x="428.027344"/>
<use xlink:href="#Helvetica-72" x="455.810547"/>
<use xlink:href="#Helvetica-65" x="489.111328"/>
<use xlink:href="#Helvetica-73" x="544.726562"/>
<use xlink:href="#Helvetica-6f" x="594.726562"/>
<use xlink:href="#Helvetica-6c" x="650.341797"/>
<use xlink:href="#Helvetica-76" x="672.558594"/>
<use xlink:href="#Helvetica-65" x="722.558594"/>
<use xlink:href="#Helvetica-64" x="778.173828"/>
<use xlink:href="#Helvetica-20" x="833.789062"/>
<use xlink:href="#Helvetica-28" x="861.572266"/>
<use xlink:href="#Helvetica-25" x="894.873047"/>
<use xlink:href="#Helvetica-29" x="983.789062"/>
</g>
</g>
</g>
@ -1368,10 +1408,10 @@ L 690 50.4
<g id="patch_7">
<path d="M 96.917045 273.70025
L 163.368917 273.70025
L 163.368917 70.420022
L 96.917045 70.420022
L 163.368917 71.190023
L 96.917045 71.190023
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3e6a8; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/>
</g>
<g id="patch_8">
<path d="M 179.981885 273.70025
@ -1379,7 +1419,7 @@ L 246.433757 273.70025
L 246.433757 81.200034
L 179.981885 81.200034
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3e6a8; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3e6a8; opacity: 0.75"/>
</g>
<g id="patch_9">
<path d="M 263.046725 273.70025
@ -1387,7 +1427,7 @@ L 329.498596 273.70025
L 329.498596 101.990058
L 263.046725 101.990058
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3d1e6; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
</g>
<g id="patch_10">
<path d="M 346.111564 273.70025
@ -1395,7 +1435,7 @@ L 412.563436 273.70025
L 412.563436 112.000069
L 346.111564 112.000069
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3d1e6; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
</g>
<g id="patch_11">
<path d="M 429.176404 273.70025
@ -1403,7 +1443,7 @@ L 495.628275 273.70025
L 495.628275 117.390075
L 429.176404 117.390075
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3d1e6; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
</g>
<g id="patch_12">
<path d="M 512.241243 273.70025
@ -1411,7 +1451,7 @@ L 578.693115 273.70025
L 578.693115 135.100095
L 512.241243 135.100095
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3d1e6; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
</g>
<g id="patch_13">
<path d="M 595.306083 273.70025
@ -1419,11 +1459,11 @@ L 661.757955 273.70025
L 661.757955 183.610149
L 595.306083 183.610149
z
" clip-path="url(#p4afbc1300d)" style="fill: #b3d1e6; opacity: 0.75"/>
" clip-path="url(#p535a156c8f)" style="fill: #b3d1e6; opacity: 0.75"/>
</g>
<g id="text_15">
<!-- 26.4% -->
<g style="fill: #555555" transform="translate(110.295794 92.012848) scale(0.14 -0.14)">
<!-- 26.3% -->
<g style="fill: #555555" transform="translate(110.295794 92.782849) scale(0.14 -0.14)">
<defs>
<path id="Helvetica-36" d="M 1872 4494
Q 2622 4494 2917 4105
@ -1462,28 +1502,6 @@ L 547 0
L 547 681
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#Helvetica-32"/>
<use xlink:href="#Helvetica-36" x="55.615234"/>
<use xlink:href="#Helvetica-2e" x="111.230469"/>
<use xlink:href="#Helvetica-34" x="139.013672"/>
<use xlink:href="#Helvetica-25" x="194.628906"/>
</g>
</g>
<g id="text_16">
<!-- 25.0% -->
<g style="fill: #555555" transform="translate(193.360633 102.79286) scale(0.14 -0.14)">
<use xlink:href="#Helvetica-32"/>
<use xlink:href="#Helvetica-35" x="55.615234"/>
<use xlink:href="#Helvetica-2e" x="111.230469"/>
<use xlink:href="#Helvetica-30" x="139.013672"/>
<use xlink:href="#Helvetica-25" x="194.628906"/>
</g>
</g>
<g id="text_17">
<!-- 22.3% -->
<g style="fill: #555555" transform="translate(276.425473 123.582883) scale(0.14 -0.14)">
<defs>
<path id="Helvetica-33" d="M 1663 -122
Q 869 -122 511 314
Q 153 750 153 1375
@ -1519,6 +1537,26 @@ Q 2438 -122 1663 -122
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#Helvetica-32"/>
<use xlink:href="#Helvetica-36" x="55.615234"/>
<use xlink:href="#Helvetica-2e" x="111.230469"/>
<use xlink:href="#Helvetica-33" x="139.013672"/>
<use xlink:href="#Helvetica-25" x="194.628906"/>
</g>
</g>
<g id="text_16">
<!-- 25.0% -->
<g style="fill: #555555" transform="translate(193.360633 102.79286) scale(0.14 -0.14)">
<use xlink:href="#Helvetica-32"/>
<use xlink:href="#Helvetica-35" x="55.615234"/>
<use xlink:href="#Helvetica-2e" x="111.230469"/>
<use xlink:href="#Helvetica-30" x="139.013672"/>
<use xlink:href="#Helvetica-25" x="194.628906"/>
</g>
</g>
<g id="text_17">
<!-- 22.3% -->
<g style="fill: #555555" transform="translate(276.425473 123.582883) scale(0.14 -0.14)">
<use xlink:href="#Helvetica-32"/>
<use xlink:href="#Helvetica-32" x="55.615234"/>
<use xlink:href="#Helvetica-2e" x="111.230469"/>
@ -1658,30 +1696,6 @@ Q 3319 0 2413 0
L 472 0
L 472 4591
z
" transform="scale(0.015625)"/>
<path id="Helvetica-63" d="M 1703 3444
Q 2269 3444 2623 3169
Q 2978 2894 3050 2222
L 2503 2222
Q 2453 2531 2275 2736
Q 2097 2941 1703 2941
Q 1166 2941 934 2416
Q 784 2075 784 1575
Q 784 1072 996 728
Q 1209 384 1666 384
Q 2016 384 2220 598
Q 2425 813 2503 1184
L 3050 1184
Q 2956 519 2581 211
Q 2206 -97 1622 -97
Q 966 -97 575 383
Q 184 863 184 1581
Q 184 2463 612 2953
Q 1041 3444 1703 3444
z
M 1616 3428
L 1616 3428
z
" transform="scale(0.015625)"/>
<path id="Helvetica-68" d="M 413 4606
L 975 4606
@ -1731,7 +1745,7 @@ z
</g>
</g>
<defs>
<clipPath id="p4afbc1300d">
<clipPath id="p535a156c8f">
<rect x="68.675" y="50.4" width="621.325" height="223.30025"/>
</clipPath>
</defs>

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 43 KiB

Before After
Before After

View file

@ -47,7 +47,7 @@ def plot_swe_bench_lite(data_file):
)
# ax.set_xlabel("Models", fontsize=18)
ax.set_ylabel("Pass rate (%)", fontsize=18, color=font_color)
ax.set_ylabel("Instances resolved (%)", fontsize=18, color=font_color)
ax.set_title("SWE Bench Lite", fontsize=20)
ax.set_ylim(0, 29)
plt.xticks(