Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update miscellaneous data #244

Merged
merged 5 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PutnamBench
# [PutnamBench](https://arxiv.org/abs/2407.11214)

<p align="center">
<a href="https://trishullab.github.io/PutnamBench/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-%23ff8811"></a>
Expand Down Expand Up @@ -32,9 +32,9 @@ We also report the number of problems in a certain category. Note that some prob
| Number Theory | 108 |
| Geometry | 69 |
| Linear Algebra | 51 |
| Abstract Algebra | 28 |
| Combinatorics | 29 |
| Probability | 10 |
| Abstract Algebra | 28 |
| Probability | 10 |
| Set Theory | 8 |

## Citation
Expand Down
4 changes: 2 additions & 2 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ <h2 class="title is-3">Abstract</h2>
We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems.
</p>
<p>
PutnamBench consists of 1697 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
All the theorems, of which there are 640, have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations.
PutnamBench consists of 1696 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
There are 644 problems formalized in Lean 4, 640 formalized in Isabelle, and 412 formalized in Coq.
</p>
<p>
Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers.
Expand Down
4 changes: 2 additions & 2 deletions docs/leaderboard.html
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ <h3 class="fw-light text-nowrap">
<!-- <div id="chart" style="width: 100%; height: 600px"></div> -->
<div class="container-fluid d-flex flex-row flex-nowrap">
<div class="container-fluid d-flex flex-column align-items-center">
<label for="lean" style="font-size: 14px;" class="text-success mb-3">Lean (out of 640)</label>
<label for="lean" style="font-size: 14px;" class="text-success mb-3">Lean (out of 644)</label>
<table
id="lean"
class="table table-responsive table-striped table-bordered flex-shrink-1 border border-success border-3"
Expand All @@ -148,7 +148,7 @@ <h3 class="fw-light text-nowrap">
></table>
</div>
<div class="container-fluid d-flex flex-column align-items-center">
<label for="coq" style="font-size: 14px;" class="text-success mb-3">Coq (out of 417)</label>
<label for="coq" style="font-size: 14px;" class="text-success mb-3">Coq (out of 412)</label>
<table
id="coq"
class="table table-responsive table-striped table-bordered flex-shrink-1 border border-danger border-3"
Expand Down
20 changes: 15 additions & 5 deletions docs/results.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
},
"size": 0,
"condensed-compute-budget": "pass@10",
"note": "pass@10 w/ T=0.1"
"note": "pass@10 w/ T=0.1; 1971 B1, 1986 B1, 1995 B1, 2001 A1."
},
"GPT-4o": {
"link": "https://openai.com/index/hello-gpt-4o/",
Expand All @@ -19,7 +19,7 @@
},
"size": 0,
"condensed-compute-budget": "pass@10",
"note": "pass@10 w/ T=0.7"
"note": "pass@10 w/ T=0.7; 1988 B1."
},
"COPRA (GPT-4o)": {
"link": "https://arxiv.org/abs/2310.04353",
Expand All @@ -30,7 +30,7 @@
},
"size": 0,
"condensed-compute-budget": "pass@1",
"note": "pass@1 w/ T=0.0"
"note": "pass@1 w/ T=0.0; 1988 B1."
},
"CoqHammer": {
"link": "https://coqhammer.github.io/",
Expand All @@ -50,7 +50,7 @@
},
"size": 0,
"condensed-compute-budget": "pass@1",
"note": "pass@1 w/ t=120s"
"note": "pass@1 w/ t=120s; 1971 B1, 2001 A1, 2012 A2."
},
"ReProver w/ retrieval": {
"link": "https://arxiv.org/abs/2306.15626",
Expand Down Expand Up @@ -100,6 +100,16 @@
},
"size": 7,
"condensed-compute-budget": "pass@596",
"note": "7 hour cumulative online proof search on 256 GPUs"
"note": "7 hour cumulative online proof search on 256 GPUs; 1977 A3, 1986 A1, 1986 A2, 1986 B1, 1988 B1, 1988 B2, 2001 A1. Commit: 8aaccf8"
},
"InternLM2.5-StepProver": {
"link": "https://arxiv.org/pdf/2410.15700",
"open-data": "FULL",
"num-solved": {
"lean-wsolution": 6
},
"size": 8.8,
"condensed-compute-budget": "pass@2x32x600",
"note": "pass @ 2 x 32 states x 600 state expansions proof search; 1977 A3, 1986 A1, 1986 B1, 1988 B1, 1988 B2, 2001 A1. Commit: b377431"
}
}
Loading