diff --git a/_posts/2024-05-31-both-swe-bench.md b/_posts/2024-05-31-both-swe-bench.md
index 774b38dadc9..4e9ffa5df8b 100644
--- a/_posts/2024-05-31-both-swe-bench.md
+++ b/_posts/2024-05-31-both-swe-bench.md
@@ -7,7 +7,7 @@ draft: true
# Aider is SOTA for both SWE Bench and SWE Bench Lite
-Aider scored 18.8%
+Aider scored 18.9%
on the main
[SWE Bench benchmark](https://www.swebench.com),
achieving a state-of-the-art result.
@@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing.
- Or, the "most plausible" solution generated by either attempt, with the
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).
-The table also provides details on the 107 solutions that were ultimately
+The table also provides details on the 108 solutions that were ultimately
verified as correctly resolving their issue.
| Attempt | Agent |Number of
proposed
solutions|Percent of
proposed
solutions| Number of
correctly
resolved
solutions | Percent of
correctly
resolved
solutions | Score on
SWE Bench
Lite |
|:--------:|------------|---------:|---------:|----:|---:|--:|
-| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% |
-| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% |
-| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
+| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
+| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
+| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |
## Non-plausible but correct solutions?
@@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus
produced plausible and/or correct solutions.
|Row|Aider
w/GPT-4o
solution
plausible?|Aider
w/GPT-4o
solution
resolved
issue?|Aider
w/Opus
solution
plausible?|Aider
w/Opus
solution
resolved
issue?|Number of
problems
with this
outcome|Number of
problems
resolved|
-|:--:|--:|--:|--:|--:|--:|--:|
-| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
-| B | **plausible** | not resolved | n/a | n/a | 181 | 0 |
-| C | non-plausible | **resolved** | **plausible** | **resolved** | 1 | 1 |
-| D | non-plausible | **resolved** | **plausible** | not resolved | 2 | 0 |
-| E | non-plausible | not resolved | **plausible** | **resolved** | 12 | 12 |
-| F | non-plausible | not resolved | **plausible** | not resolved | 53 | 0 |
-| G | non-plausible | **resolved** | non-plausible | **resolved** | 16 | 16 |
-| H | non-plausible | **resolved** | non-plausible | not resolved | 5 | 3 |
-| I | non-plausible | not resolved | non-plausible | **resolved** | 4 | 2 |
-| J | non-plausible | not resolved | non-plausible | not resolved | 216 | 0 |
-| K | non-plausible | not resolved | n/a | n/a | 7 | 0 |
-|Total|||||570|107|
+|:--:|:--:|:--:|:--:|:--:|--:|--:|
+| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
+| B | **plausible** | no | n/a | n/a | 181 | 0 |
+| C | no | no | **plausible** | no | 53 | 0 |
+| D | no | no | **plausible** | **resolved** | 12 | 12 |
+| E | no | **resolved** | **plausible** | no | 2 | 0 |
+| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 |
+| G | no | no | no | no | 216 | 0 |
+| H | no | no | no | **resolved** | 4 | 2 |
+| I | no | **resolved** | no | no | 4 | 3 |
+| J | no | **resolved** | no | **resolved** | 17 | 17 |
+| K | no | no | n/a | n/a | 7 | 0 |
+|Total|||||570|108|
Rows A-B show the cases where
aider with GPT-4o found a plausible solution during the first attempt.
@@ -233,7 +233,7 @@ So Opus' solutions were adopted and they
went on to be deemed correct for 13 problems
and incorrect for 55.
-Row D is an interesting special case, where GPT-4o found 2
+In that group, Row E is an interesting special case, where GPT-4o found 2
non-plausible but correct solutions.
We can see that Opus overrides
them with plausible-but-incorrect
@@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the
correctly resolved instances.
They were never run, used, or even visible during aider's attempts to resolve the problems.
-Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
-or 18.8%.
+Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
+or 18.9%.
## Acknowledgments
diff --git a/assets/swe_bench.jpg b/assets/swe_bench.jpg
index 85b84b8c9ae..175eb7063fe 100644
Binary files a/assets/swe_bench.jpg and b/assets/swe_bench.jpg differ
diff --git a/assets/swe_bench.svg b/assets/swe_bench.svg
index cb02c77e7eb..cdafbfae7d4 100644
--- a/assets/swe_bench.svg
+++ b/assets/swe_bench.svg
@@ -6,7 +6,7 @@
- 2024-06-01T14:47:44.878771
+ 2024-06-01T14:55:22.797792
image/svg+xml
@@ -41,12 +41,12 @@ z
-
-
+
@@ -412,7 +412,7 @@ z
-
+
@@ -583,7 +583,7 @@ z
-
+
@@ -699,7 +699,7 @@ z
-
+
@@ -894,7 +894,7 @@ z
-
+
@@ -926,7 +926,7 @@ z
-
+
@@ -1157,7 +1157,7 @@ z
-
+
@@ -1339,16 +1339,16 @@ z
+" clip-path="url(#p8c34e9879c)" style="fill: none; stroke: #b0b0b0; stroke-width: 0.2; stroke-linecap: square"/>
-
-
+
@@ -1392,18 +1392,18 @@ z
-
+
-
+
-
+
-
+
-
+
-
+
@@ -1485,18 +1485,18 @@ L 690 242.500879
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
@@ -1576,18 +1576,18 @@ L 690 144.756199
-
+
-
+
-
+
@@ -1597,18 +1597,18 @@ L 690 112.174638
-
+
-
+
-
+
@@ -1777,50 +1777,50 @@ L 690 50.4
+" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #b3d1e6; opacity: 0.3"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #1a75c2; opacity: 0.9"/>
+" clip-path="url(#p8c34e9879c)" style="fill: #1a75c2; opacity: 0.9"/>
-
+
@@ -1842,7 +1842,7 @@ z
-
+
-
+
@@ -1894,7 +1894,7 @@ z
-
+
-
+
-
+
-
-
+
+
+
-
+
-
+
-
+
@@ -2231,7 +2261,7 @@ z
-
+
@@ -2243,7 +2273,7 @@ z
-
+
@@ -2255,7 +2285,7 @@ z
-
+
@@ -2266,7 +2296,7 @@ z
-
+
@@ -2277,7 +2307,7 @@ z
-
+
@@ -2356,7 +2386,7 @@ z
-
+
diff --git a/benchmark/swe-bench.txt b/benchmark/swe-bench.txt
index fee177e3214..b3e5674b5c3 100644
--- a/benchmark/swe-bench.txt
+++ b/benchmark/swe-bench.txt
@@ -1,4 +1,4 @@
-18.8% Aider|GPT-4o|& Opus|(570)
+18.9% Aider|GPT-4o|& Opus|(570)
17.0% Aider|GPT-4o|(570)
13.9% Devin|(570)
13.8% Amazon Q|Developer|Agent|(2294)