A corpus of code for the Coq proof assistant along with several independently machine-readable representations, derived from verification projects related to the Mathematical Components (MathComp) library. The corpus can be used, e.g., in machine learning and data mining applications. Machine-readable representations are in the form of S-expressions (sexps), and were created using the SerAPI library.
The latest released version of the corpus can be downloaded from GitHub. The archive includes both the Coq source files and corresponding machine-readable representations.
The latest corpus release is based on Coq 8.10.2 and MathComp 1.9.0,
and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC).
For each source file (e.g., theory.v
), there are two corresponding files with lists of
sexps, organized at the Coq sentence level, for tokens (theory.tok.sexp
)
and abstract syntax trees (theory.ast.sexp
). Moreover, three sexp representations
are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and
elaborated term. All machine-readable representations were created using SerAPI 0.7.1,
via the SerAPI-bundled programs sercomp
, sertok
, and sername
.
A research paper describes the corpus in more detail and provides additional statistics. The corpus is divided into three tiers based on how well projects conform to the MathComp conventions.
Project | Revision SHA | No. files | LOC | Tier | License |
---|---|---|---|---|---|
finmap | 27642a8 | 4 | 6451 | 1 | CECILL-B |
fourcolor | 0851d49 | 60 | 37138 | 1 | CECILL-B |
math-comp | 748d716 | 89 | 84713 | 1 | CECILL-B |
odd-order | ca602a4 | 34 | 36125 | 1 | CECILL-B |
analysis | 9e5fe1d | 17 | 11739 | 2 | CECILL-C |
bigenough | 5f79a32 | 1 | 78 | 2 | CECILL-B |
elliptic-curves | 631af89 | 18 | 9596 | 2 | CECILL-B |
grobner | dfa54f9 | 1 | 1330 | 2 | CECILL-B |
infotheo | 6c17242 | 81 | 42295 | 2 | GPL-3.0-only |
multinomials | 691d795 | 5 | 7363 | 2 | CECILL-B |
real-closed | 495a1fa | 10 | 8925 | 2 | CECILL-B |
robot | b341ad1 | 13 | 11130 | 2 | LGPL-3.0-only |
two-square | 1c09aca | 2 | 1721 | 2 | CECILL-B |
bits | 3cd298a | 10 | 4041 | 3 | Apache-2.0 |
comp-dec-pdl | c1f813b | 16 | 4419 | 3 | CECILL-B |
disel | e8aa80c | 20 | 4473 | 3 | BSD-2-Clause |
fcsl-pcm | eef4503 | 12 | 5789 | 3 | Apache-2.0 |
games | 3d3bd31 | 12 | 4953 | 3 | BSD-2-Clause |
monae | 9d0e461 | 18 | 6655 | 3 | GPL-3.0-only |
reglang | da333e0 | 12 | 3033 | 3 | CECILL-B |
toychain | 97bd697 | 14 | 5275 | 3 | BSD-2-Clause |
The structure of the corpus is as follows:
File/directory | Contents |
---|---|
projects.yml |
List of projects, along with their URL, SHA, build command, installation command, etc. |
raw-files |
Project source files and their machine-readable representations. |
lemmas |
Lemmas for all projects in the corpus and their machine-readable statements. |
lemmas-filtered |
Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms. |
definitions |
Definitions extracted from the corpus. |
The corpus was used to train and evaluate the tool Roosterize, which suggests lemma names in Coq code using neural networks.
If you have used the corpus in a research project, please cite the corresponding research paper in any related publication:
@inproceedings{NieETAL20Roosterize,
author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos},
title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms},
booktitle = {International Joint Conference on Automated Reasoning},
pages = {97--118},
doi = {10.1007/978-3-030-51054-1_6},
year = {2020},
}
The Coq lemma sentence
Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1.
is serialized into the following tokens (simplified):
(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2)
(KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier)
(IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2)
(KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .)))
and the corresponding parsed syntax (simplified):
(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof)
(((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous()))
(CLocalAssum(Name(Id L2))(CHole()IntroAnonymous()))
(CLocalAssum(Name(Id N1))
(CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1))))))
(CNotation(InConstrEntrySomeLevel"_ -> _")
(CNotation(InConstrEntrySomeLevel"_ =i _")
(CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2))))
(CApp(CRef(Ser_Qualid(DirPath())(Id nerode)))
(CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1))))))))
and finally, the corresponding elaborated term (simplified):
(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ...
(Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ...
(Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ...
(Var (Id L1)) ... (Var (Id L2)))
(App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ...
(Var (Id L2)) ... (Var (Id N1))))))))
As described in more detail in the license file, all corpus metadata is licensed under the MIT license, while all Coq source files and corresponding S-expression files are licensed under their respective original open-source licenses.