Skip to content

Commit

Permalink
Merge pull request #158 from trishullab/george
Browse files Browse the repository at this point in the history
George
  • Loading branch information
GeorgeTsoukalas authored Jul 16, 2024
2 parents 2767831 + d2b31ae commit 1369c15
Show file tree
Hide file tree
Showing 65 changed files with 132 additions and 3,923 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ PutnamBench is a benchmark for evaluation of theorem-proving algorithms on compe

PutnamBench aims to support research in automated mathematical reasoning by providing a multilingual benchmark for evaluating theorem-proving algorithms. It is released under permissive licenses (Apache 2.0 for Lean 4 and Isabelle, MIT for Coq). The informal statements are also available with permission from the MAA.

We encourage community contributions through pull requests, though we kindly ask not to include proofs of any problems to reduce testing contamination. (TODO: After initial release, as of 7/8 this is expected to be in about 1 week.)
We strongly encourage community feedback! Please let us know if you have any comments for improving PutnamBench. If you notice any mistakes, please raise an issue on the repository and we will address it. We kindly ask that you do not write formal proofs for any of the problems in an effort to reduce contamination.

## Citation
The associated paper for PutnamBench is {TODO}. Please consider including the following citation if you find PutnamBench useful.
Expand Down Expand Up @@ -34,5 +34,5 @@ We also report the number of problems in a certain category. Note that some prob


## Versioning
- Version: `v-1`
- Not yet officially released. Please refrain from opening issues or making PRs until the initial release.
- Version: `v0`
- In preliminary release to allow for initial community feedback. We seek to release an official first version following several weeks of discussion with the community.
144 changes: 129 additions & 15 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -91,38 +91,38 @@
<h1 class="title is-1 publication-title">PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://georgetsoukalas.github.io">George Tsoukalas</a><sup>1</sup>,</span>
<a href="https://georgetsoukalas.github.io">George Tsoukalas</a>,</span>
<span class="author-block">
<a href="https://www.linkedin.com/in/jasper-w-lee">Jasper Lee</a><sup>1</sup>,</span>
<a href="https://www.linkedin.com/in/jasper-w-lee">Jasper Lee</a>,</span>
<span class="author-block">
<a href="https://www.linkedin.com/in/john-edward-jennings/">John Jennings</a><sup>1</sup>,
<a href="https://www.linkedin.com/in/john-edward-jennings/">John Jennings</a>,
</span>
<span class="author-block">
<a href="https://www.linkedin.com/in/jimmyxin31415/">Jimmy Xin</a><sup>1</sup>,
<a href="https://www.linkedin.com/in/jimmyxin31415/">Jimmy Xin</a>,
</span>
<span class="author-block">
<a href="https://www.cs.utexas.edu/~mding01/">Michelle Ding</a><sup>1</sup>,
<a href="https://www.cs.utexas.edu/~mding01/">Michelle Ding</a>,
</span>
<span class="author-block">
<a href="www.linkedin.com/in/michael-henry-jennings">Michael Jennings</a><sup>1</sup>,
<a href="www.linkedin.com/in/michael-henry-jennings">Michael Jennings</a>,
</span>
<span class="author-block">
<a href="https://amit9oct.github.io/aboutme/">Amitayush Thakur</a><sup>1</sup>
<a href="https://amit9oct.github.io/aboutme/">Amitayush Thakur</a>,
</span>
<span class="author-block">
<a href="https://www.cs.utexas.edu/~swarat/">Swarat Chaudhuri</a><sup>1</sup>
<a href="https://www.cs.utexas.edu/~swarat/">Swarat Chaudhuri</a>
</span>
</div>

<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>UT Austin,</span>
<span class="author-block">University of Texas at Austin</span>
</div>

<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://github.com/trishullab/putnambench"
<a href="https://arxiv.org"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
Expand All @@ -131,7 +131,7 @@ <h1 class="title is-1 publication-title">PutnamBench: A Multilingual Competition
</a>
</span>
<span class="link-block">
<a href="https://github.com/trishullab/putnambench"
<a href="https://arxiv.org"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
Expand All @@ -149,6 +149,15 @@ <h1 class="title is-1 publication-title">PutnamBench: A Multilingual Competition
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://github.com/trishullab/putnambench/leaderboard"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Leaderboard</span>
</a>
</span>
</div>

</div>
Expand All @@ -165,14 +174,11 @@ <h1 class="title is-1 publication-title">PutnamBench: A Multilingual Competition
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
TODO: This website is still under construction.
</p>
<p>
We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems.
</p>
<p>
PutnamBench consists of 1600+ hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
PutnamBench consists of 1697 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
All the theorems, of which there are 640, have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations.
</p>
<p>
Expand All @@ -186,6 +192,114 @@ <h2 class="title is-3">Abstract</h2>
</div>
</section>

<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Formal Theorem-Proving</h2>
<div class="content has-text-justified">
<p>
Formal theorem-proving is the task of proving mathematical theorems specified inside an interactive theorem prover such as Lean, Isabelle, or Coq.
These systems allow one to specify theorems and proofs inside a programming language. The correctness of a proof is <em>automatically checked</em> by the compiler.
</p>
<p>
Informal benchmarks such as MATH and GSM8K measure accuracy by exact match with the ground truth numerical answer. However, the reasoning path might be incorrect and <em>cannot be checked automatically</em> as the problem is specified in natural language.
By contrast, benchmarking formal theorem-proving through PutnamBench allows one to measure a model's capacity to produce precise logical arguments in a mathematical context. Furthermore, since no formal proofs have been written for any of our problems, direct contamination is not possible, whereas more measures must be taken to prevent contamination with informal reasoning benchmarks.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
</div>
</section>


<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">The William-Lowell Putnam Mathematical Competition</h2>
<div class="content has-text-justified">
<p>
The William Lowell Putnam Mathematical Competition, organized by the Mathematical Association of America (MAA), is the premier collegiate mathematics competition in North America. Thousands of undergraduate students from universities across the United States and Canada take the exam each year. The competition comprises two 3-hour-long sessions of six problems each, presented in approximately ascending order of difficulty within each session.
</p>
<p>
The Putnam Competition is known for its challenging problems that require a deep understanding of undergraduate-level mathematics and creative problem-solving ability. The competition includes problems involving concepts from a wide variety of topics in the undergraduate curriculum, including analysis, abstract algebra, and linear algebra.
</p>
<p>
The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists. Hence, we find that PutnamBench is a well-suited benchmark for measuring progress towards the <a href="https://aimoprize.com/">Artificial Intelligence Mathematical Olympiad</a> and the <a href="https://imo-grand-challenge.github.io/">IMO Grand Challenge</a>."
</p>
</div>
</div>
</div>
</div>
</section>

<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">PutnamBench: Multilinguality & Factored Solutions</h2>
<div class="content has-text-justified">
<p>
<b>Multilinguality.</b> Most recent work on neural theorem-proving for competition mathematics has been focused in Lean and Isabelle. PutnamBench supports researchers using these tools by incorporating formalizations of all 640 problems in these languages. Previously without a benchmark for competition math, we also include formalizations in Coq, of which there are 417. Formalization obstacles specific to Coq, as mentioned in our paper, prevent us from formalizing all 640 problems. We actively support community feedback and contributions as we continue to iterate on PutnamBench-Coq.
</p>
<p>
We hope that the multilinguality of PutnamBench will allow the theorem-proving community to make progress towards general mathematical reasoning machines.
</p>
<p><b>Factored Solutions.</b> Roughly 60% of Putnam problems, in their natural language form, require exhibiting a (closed-form) solution along with a proof of its correctness. Such problems do not assert propositions, and hence are not immediately formalizable as they are not directly the statement of a theorem. Prior benchmarks have included the explicit solution in the theorem statement, but this can drastically reduce the problem's overall difficulty, as finding the answer can be the most challenging part. For such problems, we factor out the solution from the theorem statement, allowing two tasks: proving the theorem with the solution provided, and additionally determining the solution.</p>
</div>
</div>
</div>
</div>
</section>



<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Evaluations & Leaderboard</h2>
<div class="content has-text-justified">
<p>
We perform evaluations with some recent models, as described in the <a href="arxiv.org">TODO: PAPER</a>.
We find that PutnamBench is <b>hard</b>, with an aggregration of all models tried unable to successfully prove even 1% of theorems.
In particular, few-shot GPT-4o solves <b>only a single problem</b> in each language when allotted 10 attempts.
PutnamBench highlights several deficiencies of large language models (LLMs): weak ability to (1) perform precise logical reasoning, (2) operate in low-resource languages.
</p>
<p>
We are actively maintaining a <a href="arxiv.org">TODO: Leaderboard</a> to keep track of progress from upcoming approaches for neural-theorem proving. We will be incorporating results which are accompanied by a research publication (or arxiv preprint) detailing the general methodology.
</p>
</div>
</div>
</div>
</div>
</section>

<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Licensing & Maintenance</h2>
<div class="content has-text-justified">
<p>
PutnamBench is available under an Apache 2.0 License for Isabelle & Lean, and an MIT License for Coq. With a big thanks to the MAA, we also include the informal problem statements as sourced from the official archives.
</p>
<p>
PutnamBench will be actively maintained for the foreseeable future. We plan to incorporate problem formalizations from Putnam 2024 in December. <b>We are actively encouraging community feedback and discussion.</b> We are interested in hearing how you might plan to use PutnamBench! Your feedback may help us iterate on the structure of PutnamBench, such as incorporating a validation/training split. Please reach out to us if you have (1) suggestions to improve the benchmark, (2) comments on maintaining benchmark longevity and usefulness, (3) noticed issues in certain formalizations which require modification.
</p>
</div>
</div>
</div>
</div>
</section>


<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
Expand Down
Loading

0 comments on commit 1369c15

Please sign in to comment.