diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md index 774b38dadc9..4e9ffa5df8b 100644 --- a/_posts/2024-05-31-both-swe-bench.md +++ b/_posts/2024-05-31-both-swe-bench.md @@ -7,7 +7,7 @@ draft: true # Aider is SOTA for both SWE Bench and SWE Bench Lite -Aider scored 18.8% +Aider scored 18.9% on the main [SWE Bench benchmark](https://www.swebench.com), achieving a state-of-the-art result. @@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing. - Or, the "most plausible" solution generated by either attempt, with the [fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution). -The table also provides details on the 107 solutions that were ultimately +The table also provides details on the 108 solutions that were ultimately verified as correctly resolving their issue. | Attempt | Agent |Number of
proposed
solutions|Percent of
proposed
solutions| Number of
correctly
resolved
solutions | Percent of
correctly
resolved
solutions | Score on
SWE Bench
Lite | |:--------:|------------|---------:|---------:|----:|---:|--:| -| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% | -| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% | -| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** | +| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% | +| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% | +| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** | ## Non-plausible but correct solutions? @@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus produced plausible and/or correct solutions. |Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome|Number of
problems
resolved| -|:--:|--:|--:|--:|--:|--:|--:| -| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 | -| B | **plausible** | not resolved | n/a | n/a | 181 | 0 | -| C | non-plausible | **resolved** | **plausible** | **resolved** | 1 | 1 | -| D | non-plausible | **resolved** | **plausible** | not resolved | 2 | 0 | -| E | non-plausible | not resolved | **plausible** | **resolved** | 12 | 12 | -| F | non-plausible | not resolved | **plausible** | not resolved | 53 | 0 | -| G | non-plausible | **resolved** | non-plausible | **resolved** | 16 | 16 | -| H | non-plausible | **resolved** | non-plausible | not resolved | 5 | 3 | -| I | non-plausible | not resolved | non-plausible | **resolved** | 4 | 2 | -| J | non-plausible | not resolved | non-plausible | not resolved | 216 | 0 | -| K | non-plausible | not resolved | n/a | n/a | 7 | 0 | -|Total|||||570|107| +|:--:|:--:|:--:|:--:|:--:|--:|--:| +| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 | +| B | **plausible** | no | n/a | n/a | 181 | 0 | +| C | no | no | **plausible** | no | 53 | 0 | +| D | no | no | **plausible** | **resolved** | 12 | 12 | +| E | no | **resolved** | **plausible** | no | 2 | 0 | +| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 | +| G | no | no | no | no | 216 | 0 | +| H | no | no | no | **resolved** | 4 | 2 | +| I | no | **resolved** | no | no | 4 | 3 | +| J | no | **resolved** | no | **resolved** | 17 | 17 | +| K | no | no | n/a | n/a | 7 | 0 | +|Total|||||570|108| Rows A-B show the cases where aider with GPT-4o found a plausible solution during the first attempt. @@ -233,7 +233,7 @@ So Opus' solutions were adopted and they went on to be deemed correct for 13 problems and incorrect for 55. -Row D is an interesting special case, where GPT-4o found 2 +In that group, Row E is an interesting special case, where GPT-4o found 2 non-plausible but correct solutions. We can see that Opus overrides them with plausible-but-incorrect @@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the correctly resolved instances. They were never run, used, or even visible during aider's attempts to resolve the problems. -Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked, -or 18.8%. +Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked, +or 18.9%. ## Acknowledgments diff --git a/assets/swe_bench.jpg b/assets/swe_bench.jpg index 85b84b8c9ae..175eb7063fe 100644 Binary files a/assets/swe_bench.jpg and b/assets/swe_bench.jpg differ diff --git a/assets/swe_bench.svg b/assets/swe_bench.svg index cb02c77e7eb..cdafbfae7d4 100644 --- a/assets/swe_bench.svg +++ b/assets/swe_bench.svg @@ -6,7 +6,7 @@ - 2024-06-01T14:47:44.878771 + 2024-06-01T14:55:22.797792 image/svg+xml @@ -41,12 +41,12 @@ z - - + @@ -412,7 +412,7 @@ z - + @@ -583,7 +583,7 @@ z - + @@ -699,7 +699,7 @@ z - + @@ -894,7 +894,7 @@ z - + @@ -926,7 +926,7 @@ z - + @@ -1157,7 +1157,7 @@ z - + @@ -1339,16 +1339,16 @@ z +" clip-path="url(#p8c34e9879c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/> - - + @@ -1392,18 +1392,18 @@ z - + - + - + - + - + - + @@ -1485,18 +1485,18 @@ L 690 242.500879 - + - + - + - + - + - + - + - + - + @@ -1576,18 +1576,18 @@ L 690 144.756199 - + - + - + @@ -1597,18 +1597,18 @@ L 690 112.174638 - + - + - + @@ -1777,50 +1777,50 @@ L 690 50.4 +" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/> +" clip-path="url(#p8c34e9879c)" style="fill: #1a75c2; opacity: 0.9"/> +" clip-path="url(#p8c34e9879c)" style="fill: #1a75c2; opacity: 0.9"/> - + @@ -1842,7 +1842,7 @@ z - + - + @@ -1894,7 +1894,7 @@ z - + - + - + - - + + + - + - + - + @@ -2231,7 +2261,7 @@ z - + @@ -2243,7 +2273,7 @@ z - + @@ -2255,7 +2285,7 @@ z - + @@ -2266,7 +2296,7 @@ z - + @@ -2277,7 +2307,7 @@ z - + @@ -2356,7 +2386,7 @@ z - + diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt index fee177e3214..b3e5674b5c3 100644 --- a/benchmark/swe-bench.txt +++ b/benchmark/swe-bench.txt @@ -1,4 +1,4 @@ -18.8% Aider|GPT-4o|& Opus|(570) +18.9% Aider|GPT-4o|& Opus|(570) 17.0% Aider|GPT-4o|(570) 13.9% Devin|(570) 13.8% Amazon Q|Developer|Agent|(2294)