DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.
- Dataset: The dataset for DafnyBench (with 782 programs) could be found in the
DafnyBench
directory, which contains theground_truth
set & thehints_removed
set (with compiler hints, i.e. annoataions, removed). - Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the
hints_removed
set and checking if the reconstructed program could be verified by Dafny. Please refer to theeval
directory.
- Install Dafny on your machine by following this tutorial
- Clone &
cd
into this repository - Set up environment by running the following lines:
python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval
- Set up environment variable for the root directory:
export DAFNYBENCH_ROOT=
- Set up environment variable for path to Dafny executable on your machine (for example,
/opt/homebrew/bin/Dafny
):
export DAFNY_PATH=
- If you're evaluating an LLM through API access, set up API key. For example:
export OPENAI_API_KEY=
- You can choose to evaluate an LLM on a single test program, such as:
python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"
or evaluate on the entire dataset:
export model_to_eval='gpt-4o'
./run_eval.sh
DafnyBench
- A collection of 782 Dafny programs. Each program has a
ground_truth
version that is fully verified with Dafny & ahints_removed
version that has hints (i.e. annotations) removed
- A collection of 782 Dafny programs. Each program has a
eval
- Contains scripts to evaluate LLMs on DafnyBench
results
results_summary
- Dataframes that summarize LLMs' success on every test programreconstructed_files
- LLM outputs with hints filled back inanalysis
- Contains a notebook for analyzing the results