This repository provides the code, data, and results for the paper: Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study [arXiv].
The figure above shows how humans play Minesweeper (left) and how GPT-3.5-instruct plays it (right).This project is built upon Python 3.11 and PyQt6.
For a complete list of required packages, please find them in the requirements.txt
file.
It is recommended to create a new conda
environment for this project as it may be tricky to install PyQt as it can mess up your current dependencies.
conda create -n ms python=3.11
conda activate ms
pip install -r requirements.txt
We have provided our experiment results and some ablation studies in the ./output
folder.
If you are interested in reproducing our results or extending the experiment to a broader set of Minesweeper boards, you can use the Python scripts in ./tasks/
folder.
Specifically,
-
ms.py
runs GPT models on the Minesweeper game. The arguments of this script are defined in./src/args.py
. Examples of running this script are provided in./scripts
, where./5x5.table.sh
tests GPTs on$5\times5$ boards with table representation and in natural conversation mode (please refer to the paper for the description of these terms), and./5x5.coord-ch.sh
tests on boards with coordinate representation and in compact history mode. To run these shell scripts, you can use
./scripts/5x5.table.sh
bn.py
implements the "board navigation" task defined in the paper.nc.py
implements the "neighbor counting" task defined in the paper.
Notice that we use corporate GPT APIs, which are slightly different from the general user APIs.
If you are using the same kind of API as ours, you can directly fill in the blanks within the ./reousrces/*.json
files and start running.
If not, you may also need to modify the src.gpt.GPT.response
function to suit your need.
Currently, we have not implemented a script for just playing Minesweeper with GUI.
But you can still do this by running the ./assist/lable_board.py
script.
Specifically,
PYTHONPATH="." python ./assist/label_board.py --data_dir [your data dir] --disable_saving
You can either use our provided data or generate Minesweeper boards of your own through ./assist/generate_board.py
.
If you find our work helpful, please consider citing it as
@article{Li.2023.Minesweeper,
author = {Yinghao Li and
Haorui Wang and
Chao Zhang},
title = {Assessing Logical Puzzle Solving in Large Language Models: Insights
from a Minesweeper Case Study},
journal = {CoRR},
volume = {abs/2311.07387},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2311.07387},
doi = {10.48550/ARXIV.2311.07387},
eprinttype = {arXiv},
eprint = {2311.07387},
timestamp = {Wed, 15 Nov 2023 16:23:10 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2311-07387.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}