Skip to content

Commit

Permalink
copy
Browse files Browse the repository at this point in the history
  • Loading branch information
paul-gauthier committed Jun 1, 2024
1 parent 2febc66 commit 47a3cb8
Show file tree
Hide file tree
Showing 4 changed files with 135 additions and 105 deletions.
42 changes: 21 additions & 21 deletions _posts/2024-05-31-both-swe-bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ draft: true

# Aider is SOTA for both SWE Bench and SWE Bench Lite

Aider scored 18.8%
Aider scored 18.9%
on the main
[SWE Bench benchmark](https://www.swebench.com),
achieving a state-of-the-art result.
Expand Down Expand Up @@ -135,14 +135,14 @@ aider reported no outstanding errors from editing, linting and testing.
- Or, the "most plausible" solution generated by either attempt, with the
[fewest outstanding editing, linting or testing errors](https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution).

The table also provides details on the 107 solutions that were ultimately
The table also provides details on the 108 solutions that were ultimately
verified as correctly resolving their issue.

| Attempt | Agent |Number&nbsp;of<br>proposed<br>solutions|Percent&nbsp;of<br>proposed<br>solutions| Number&nbsp;of<br/>correctly<br>resolved<br>solutions | Percent&nbsp;of<br>correctly<br>resolved<br>solutions | Score&nbsp;on<br>SWE&nbsp;Bench<br>Lite |
|:--------:|------------|---------:|---------:|----:|---:|--:|
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 81.3% | 15.3% |
| 2 | Aider with Opus | 151 | 26.5% | 20 | 18.7% | 3.5% |
| **Total** | | **570** | **100%** | **107** | **100%** | **18.8%** |
| 1 | Aider with GPT-4o | 419 | 73.5% | 87 | 80.6% | 15.3% |
| 2 | Aider with Opus | 151 | 26.5% | 21 | 19.4% | 3.7% |
| **Total** | | **570** | **100%** | **108** | **100%** | **18.9%** |

## Non-plausible but correct solutions?

Expand Down Expand Up @@ -205,19 +205,19 @@ showing whether aider with GPT-4o and with Opus
produced plausible and/or correct solutions.

|Row|Aider<br>w/GPT-4o<br>solution<br>plausible?|Aider<br>w/GPT-4o<br>solution<br>resolved<br>issue?|Aider<br>w/Opus<br>solution<br>plausible?|Aider<br>w/Opus<br>solution<br>resolved<br>issue?|Number of<br>problems<br>with this<br>outcome|Number of<br>problems<br>resolved|
|:--:|--:|--:|--:|--:|--:|--:|
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
| B | **plausible** | not resolved | n/a | n/a | 181 | 0 |
| C | non-plausible | **resolved** | **plausible** | **resolved** | 1 | 1 |
| D | non-plausible | **resolved** | **plausible** | not resolved | 2 | 0 |
| E | non-plausible | not resolved | **plausible** | **resolved** | 12 | 12 |
| F | non-plausible | not resolved | **plausible** | not resolved | 53 | 0 |
| G | non-plausible | **resolved** | non-plausible | **resolved** | 16 | 16 |
| H | non-plausible | **resolved** | non-plausible | not resolved | 5 | 3 |
| I | non-plausible | not resolved | non-plausible | **resolved** | 4 | 2 |
| J | non-plausible | not resolved | non-plausible | not resolved | 216 | 0 |
| K | non-plausible | not resolved | n/a | n/a | 7 | 0 |
|Total|||||570|107|
|:--:|:--:|:--:|:--:|:--:|--:|--:|
| A | **plausible** | **resolved** | n/a | n/a | 73 | 73 |
| B | **plausible** | no | n/a | n/a | 181 | 0 |
| C | no | no | **plausible** | no | 53 | 0 |
| D | no | no | **plausible** | **resolved** | 12 | 12 |
| E | no | **resolved** | **plausible** | no | 2 | 0 |
| F | no | **resolved** | **plausible** | **resolved** | 1 | 1 |
| G | no | no | no | no | 216 | 0 |
| H | no | no | no | **resolved** | 4 | 2 |
| I | no | **resolved** | no | no | 4 | 3 |
| J | no | **resolved** | no | **resolved** | 17 | 17 |
| K | no | no | n/a | n/a | 7 | 0 |
|Total|||||570|108|

Rows A-B show the cases where
aider with GPT-4o found a plausible solution during the first attempt.
Expand All @@ -233,7 +233,7 @@ So Opus' solutions were adopted and they
went on to be deemed correct for 13 problems
and incorrect for 55.

Row D is an interesting special case, where GPT-4o found 2
In that group, Row E is an interesting special case, where GPT-4o found 2
non-plausible but correct solutions.
We can see that Opus overrides
them with plausible-but-incorrect
Expand Down Expand Up @@ -271,8 +271,8 @@ and the benchmark harness, and only to compute statistics about the
correctly resolved instances.
They were never run, used, or even visible during aider's attempts to resolve the problems.

Aider correctly resolved 107 out of 570 SWE Bench instances that were benchmarked,
or 18.8%.
Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
or 18.9%.

## Acknowledgments

Expand Down
Binary file modified assets/swe_bench.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 47a3cb8

Please sign in to comment.