Zanj integration: datasets & training #177

mivanit · 2023-04-13T17:21:00Z

(this is a mega pr, sorry)

configs

Modifying configs from the command line is now easier!

ConfigHolder.get_config_multisource()
- takes one of: config object, a file to read the config from, or a list of names to get presets for each of the sub-configs
- a dotlist-dict to modify any parameters of the config
GPTDataset().to_fname() used to generate filename for saving config (and also to find matching config to load/download). MazeDatasetConfig also implements this in a custom way
MazeDatasetConfig now has a maze_ctor_kwargs field, for passing keyword arguments to maze generation (see Constrained depth first search #183)

maze dataset

You can now get a MazeDataset from just a config -- it will load, download, or generate a dataset on the fly. The mess of ways of storing a dataset we had before is now gone -- a MazeDataset contains a list of SolvedMaze, and it will return one of those when you call __getitem__. We also added filters and fixed some parallelization issues!

GPTDataset().from_config() as a new, simplified version of getting a dataset: simply pass a config, and it will attempt to load from local directory, download, or generate. any of these can be disabled, and kwargs (for things like # of cores to use) are passed down.
canonical representation of the dataset as list of SolvedMaze
mazes_objs, mazes_tokens, mazes_array are now cached properties. they will work, but might be slow due to no parallelization
MazeDataset.__getitem__() now returns a SolvedMaze
create_dataset() deprecated but should still work. remove this?
filtering! you can specify filters in the config under the applied_filters field, or you can call dataset.filter_by.your_filter_func(your_arg=your_val). Both of these work the same under the hood.
can specify in from_config() whether to run in parallel or not (default is no). this is useful since for small datasets, parallelization has huge overhead. tests are now much faster.
there may have been some issues to parallelization and using the same fixed seed across all processes. This was fixed in Constrained depth first search #183 , but in a hacky way

training

Models now saved as ZANJ objects, and the command line interface is improved.

train() now:
- saves models as ZANJ
- returns the trained ZanjHookedTransformer
train_model():
- now returns TrainingResult which contains output path, model, and eventually logging info perhaps?
- for config, interface inherited from ConfigHolder.get_config_multisource() and kwargs are passed as modification dict

remaining todos:

questions:

how should we handle hashes being included in GPTDataset.to_fname() ?

resolved: no strong opinions, can change this without too much cost. including hash for now.
should MazeDataset.__getitem__() give a SolvedMaze, string or tokenized array?

resolved: use SolvedMaze everywhere

- config equality check gets the diff from a ConfigMismatchException - this requires a yet-unpublished feature in muutils, coming in 0.3.7 - old `load_model_with_configs` now takes a `foln_ln` arg which it passes when calling `model.process_weights_()` this is hopefully temporary since ZANJ now records whether state dict was folded!

…ticeMaze

…getedLatticeMaze" there is a bug, but my fix does not fix it! This reverts commit 88002f6.

* return SolvedMazes from dataset.__getitem__ * Move tokenization into Maze classes * Move batch preprocessing into dataloader * Lots of tests for datasets * Tidy up filters a bit and allow positional args * Speed up tests by using a non-parallel dataloader * integration-v1 training config renamed to test-v1

@canrager

- constraint options for `gen_dfs` generation algorithm (by @canrager) - added `maze_ctor_kwargs` to `MazeDatasetConfig` to allow setting those options - fixed some issues arising from parallelism + fixed seed (this was hacky) - minor things: - bumped muutils to 0.3.10 - we now use `Coord` and `CoordArray` (numpy) in many places, instead of tuples/lists - separated `MAZE_DATASET_CONFIGS` to [maze_transformer/training/maze_dataset_configs.py](https://github.com/AISC-understanding-search/maze-transformer/pull/184/files#diff-ab008b2d4ddb7138116afef18584f657832ec00430af732f195136a63b0debaf) - some random junk --------- Co-authored-by: mivanit <[email protected]>

…nderstanding-search/maze-transformer into zanj-integration-datasets

…ker init

mivanit · 2023-04-28T03:38:31Z

@valedan here are the remaining problems which we need to fix before merging. Once tests pass, I think we are good to go!

transposing issue in baseline solver tests: tests/unit/maze_transformer/evaluation/test_baseline_models.py
- unclear why this is only causing issues with baseline solver
- basically, the solution the baseline solver is a valid solution but only in the transpose of the maze. this is probably happening somewhere in the tokenization and might be indicative of a larger issue of some kind, but I hope not.
RESOLVED: bumping pytest fixed it! ~~3 errors of fixture "mocker" not found in tests/unit/maze_transformer/training/test_dataset.py~~

valedan

🚀

mivanit added 30 commits March 28, 2023 11:23

wip

25d745d

Merge branch 'zanj-integration' into zanj-integration-2

da2e05f

wip

7fdbdb0

bump muutils to 0.3.3, some zanj tests working with that

f7abcb0

misc

a31d4ba

something with layernorm is causing the tensor elements not to match up

705e1f6

???

34a62fc

exact loading of model works!

a6a5b32

ugh not quite, only working if layernorm folding disabled

0181b02

wip

9e2fe97

zanj save/load tests passing?

07aa160

fixed some unit tests, test_eval_model still fails >:(

e1b28b4

so confused, test only fails when model generated via training?

84d3ae8

merge with main (and bump muutils to 0.3.6)

2019ed4

Merge branch 'add-notebook-testing' into zanj-integration-2

1db5c61

bump muutils to 0.3.7

075ff2b

updated poetry.lock

808e333

prelim to/from ascii and pixels methods, might need to be moved

04b9d09

run notebook

9ab36f7

merge with add-notebook-testing

4548296

wip

377724a

wip

2406dea

this was some of the most paintful debugging ive ever done

70e99f5

format

a8a52af

bump muutils

8ab6e79

merge with main

6bf592b

fixes?

820f0b3

format

ecb1872

update poetry lock

b650af9

Base automatically changed from add-maze-from-ascii to main April 20, 2023 20:29

mivanit and others added 13 commits April 20, 2023 14:37

fixed bug in cut_percentile_shortest and ran formatting

2c13e51

merging in from main

a510d41

format, resolved a forgotten merge conflict

990dbb0

MazePath dissapeared again???

2d91858

format (removed jaxtyping import)

45e75dd

added a TODO of something to implement for constrained dfs kwargs

135435a

dumb bug that probably doesnt matter since we will remove TargetedLat…

88002f6

…ticeMaze

Revert "dumb bug that probably doesnt matter since we will remove Tar…

e0cd326

…getedLatticeMaze" there is a bug, but my fix does not fix it! This reverts commit 88002f6.

format

88402cd

format

e8b7196

Merge branch 'zanj-integration-datasets' of https://github.com/AISC-u…

6d942ef

…nderstanding-search/maze-transformer into zanj-integration-datasets

mivanit marked this pull request as ready for review April 28, 2023 03:08

mivanit added 3 commits April 27, 2023 21:33

fixed maze dataset config hash usage, removed print from parallel wor…

54ff5a0

…ker init

format

c99f652

fixed notebook test

e30f3f0

mivanit and others added 2 commits April 27, 2023 21:45

bumpy pytest to 7.3.1 to resolve missing 'mocker' fixture

e58c348

fix biased baseline

e2f9039

valedan approved these changes Apr 28, 2023

View reviewed changes

mivanit merged commit 06c8181 into main Apr 28, 2023

mivanit deleted the zanj-integration-datasets branch April 28, 2023 15:24

mivanit mentioned this pull request Aug 6, 2023

workflow for adjusting training parameters makes experimentation difficult #98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zanj integration: datasets & training #177

Zanj integration: datasets & training #177

mivanit commented Apr 13, 2023 •

edited

Loading

mivanit commented Apr 28, 2023 •

edited by valedan

Loading

valedan left a comment

Zanj integration: datasets & training #177

Zanj integration: datasets & training #177

Conversation

mivanit commented Apr 13, 2023 • edited Loading

configs

maze dataset

training

remaining todos:

questions:

mivanit commented Apr 28, 2023 • edited by valedan Loading

valedan left a comment

Choose a reason for hiding this comment

mivanit commented Apr 13, 2023 •

edited

Loading

mivanit commented Apr 28, 2023 •

edited by valedan

Loading